On 8/1/2024 12:10 PM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
Some amount of the cases where consistency issues have come up in my
case have do do with RAM-backed hardware devices, like the rasterizer
module. It has its own internal caches that need to be flushed, and not
flushing caches (between this module and CPU) when trying to "transfer"
control over things like the framebuffer or Z-buffer, can result in
obvious graphical issues (and, texture-corruption doesn't necessarily
look good either).
The approach taken on AMD64 CPUs is to have different memory types
(and associated memory type range registers). Plain DRAM is
write-back cached, but there is also write-through and uncacheable
memory. For a frame buffer that is read by some hardware that can
access the memory independently, write-through seems to be the way to
go.
In this case, both ends tend to do write-back caching.
The rasterizer has a texture-cache, which reads from memory and needs to be flushed if a recently used texture is updated.
The modules' front end interface consists of a collection of MMIO registers. Setting a bit in one of the registers loads their contents into some FIFO buffers. Behind the FIFO buffers, the module removes a command, and performs it. There are also a status flag in one of the MMIO registers which indicates whether the module is still busy or if the FIFO is full.
The MMIO registers are used to set things like the address of the framebuffer, address of the bound texture, etc. Which are arbitrary locations in the physical RAM space (no separate area for display memory).
Because of the need for physical addresses and MMIO to use the module, the GL backend needs to run in kernel mode to use hardware rasterization (which is a drawback, now means there is a bunch of system call context switch overhead added to the mix for each "draw call").
Where, generally, a lot of calls which merely alter parameters can run in usermode; a call is made into the backend mostly to draw a current set of vertex arrays (if glBegin/glEnd is used, it will build it in a temporary set of vertex arrays, and then draw these arrays).
Note that the module doesn't work at the level of primitives, rather it only does edge-walking, so is given parameters for the left and right edges (something like a triangle will typically result in 1 or 2 requests, a quad 1 to 3). It doesn't (itself) understand mipmapping, so mipmaps are treated as different textures as far as the rasterizer goes (in TKRA-GL, mipmaps are currently per-primitive in the backend).
Though, larger on-screen primitives are subdivided (for sake of limiting affine warping) so a large input primitive may still use multiple mipmap levels.
Though, subdividing does not seem to be the main source of rendered primitives, as the bulk of the primitives tend to be below the size limit (though, there is a certain level of overhead needed to figure out whether or not a primitive is above or below the size limit).
There was an attempt to do "poor man's perspective correct", namely interpolating in Z rather than 1/Z, but this didn't work very well and could result in worse warping issues in some cases than the affine warping it was meant to prevent.
Main issue is that there is currently not really a viable way to calculate the needed reciprocal. The tricks used for dividing by the span length do not work for turning 1/Z into Z.
As can be noted, the module only deals with square or rectangular power-of-2 textures, with pixels or texture blocks expressed in Morton order (which imposes a constraint on texture sizes).
A Morton-order texture can express sizes of, say:
128x128, 256x128, 128x64, ...
But not:
512x128, 256x64
And, only indirectly:
128x256, 64x128, ...
With these cases needing to be handled by switching the S and T coordinates for the texture.
Say, if one sees the texture coordinate as a mask of bits:
tstststststststs
Then any mask will either give an even number of bits (square texture) or odd (2x the width).
Luckily, this can deal with nearly all textures (or, can be enforced, by resampling textures if needed).
Generally, pixels are repacked on texture upload. As is the conversion from raster order DXT1 or DXT5 to Morton order UTX2. Can note that UTX2 exists mostly because DXT1 and DXT5 would have been more expensive.
Where, compressed texture formats were:
UTX1: 32-bit RGBD with 4x4x1 selectors, mostly defunct.
Would give a 12-bit RGB, with D giving the luminance delta;
The delta is basically added to the RGB to give a pair of endpoints;
No transparent pixels, image quality not great either.
UTX2: 64-bit, 2x RGB555, 4x4x2 selectors, multiple modes.
00: Behaves like opaque DXT1 (2bpp interpolated selectors)
01: Each selector functions as a 1-bit RGB and A selector.
There are two 3 bit alpha levels per block.
10: Mimics DXT1 behavior for blocks with transparent pixels.
11: Single RGBA gradient (with 2bpp interpolated selectors)
Color Endpoints Selectors:
00=ColorB
01=2/3 ColorB + 1/3 ColorA
10=1/3 ColorB + 2/3 ColorA
11=ColorA
Except in the DXT1 mimic mode:
00=ColorB
01=Behaves like prior 01 or 10 based on a bayer pattern
10=Transparent
11=ColorA
UTX3: 128-bit, 2x BGBA_FP8U | RGBA32, 2x 4x4x2
Interpolation similar to UTX2, just separate RGB and A;
Interpolates FP8U as raw 8-bit values, before conversion to FP16;
The 'BLKUTX3L' instruction interprets it as RGBA32/RGBA8888;
The 'BLKUTX3H' instruction interprets it as RGBA_FP8U;
Limited use thus far.
Technically, UTX3 could be a better stand-in for DXT5 (in terms of quality), as well as being able to take over the role of BC6H and BC7 (though, in most cases, would first require decoding these to RGBA and then repacking). Unlike BC6H or BC7, UTX3 has no concept of color partitions.
Thus far, only UTX2 is used by TKRA-GL (mostly for "effort reasons").
Uncompressed textures thus far are generally in the RGB555A format (or, would be RGBA_FP8U if HDR support were implemented).
TBD how this would be pulled off with the HW rasterizer, though the "poor man's" option would be to do HDR by mostly just treating the FP8U values like normal LDR values (and mostly having special case handling in the modulation/blending operators).
Though tacky, interpolating floating point values as integer values has usually worked "slightly better than one might otherwise expect" (though tends to be S-curved rather than linear).
The FIFO allows the requests to be submitted quickly so that the CPU can get on to the next primitive. Though, in general, this rasterizer module seems to be faster than the CPU's ability to keep it fed.
Had noted that, with this module, it was viable to do multitexturing. In this case two sets of parameters are calculated for each primitive, and then the requests for both texture layers are submitted to the module.
This can allow running GLQuake in lightmap mode.
Though, ran into a different problem here:
GLQuake's strategy for dealing with animated lights involves repeatedly recalculating and re-uploading the lightmap textures, which is too slow.
A faster option might be to modify the lightmap handling to build two copies of the lightmap:
One representing the maximum contribution from each non-static light source;
One representing the minimum contribution from each non-static light source.
Then, calculate the current relative contributions from the light sources, and select the texture based on comparing it to the median.
This would not deal with slow strobe effects, but would avoid burning the CPU on updating the dynamic lightmaps.
Ironically, had noted that Quake 3 has a cvar to disable the dynamic lightmap updates, implying that these were potentially still a performance issue in Quake 3's days.
Though, Quake 3 is generally slower than GLQuake (probably surprising pretty much no one). Though, might be faster if I can figure out how to convince it to use vertex arrays rather than the glBegin/glEnd interface.
In the past, trying to optimize the non-subdivided primitive case had mixed results. It has also seemed like a thought that I may need to rework how this part of the operation works at some point.
Say, after receiving a vertex array:
Extract the various coordinates from the various arrays depending on the types of the array elements, as appropriate for the type of primitive being drawn.
Currently, this separates primitives into triangles and quads, with larger primitives (such as polygons) being decomposed into triangles and quads (though, TRIANGLE_FAN is always decomposed to triangles; POLYGON may potentially also produce quads).
Then, for the primitive types, their vertices are pushed onto stacks.
A loop runs which pops off the vertices, projects them, checks against various properties (outside frustum, backface, needs to be split up, ...).
If everything is good, and the primitive is not discarded, the primitive is drawn. If it is over the size limit, it is broken apart and the parts are pushed back onto the stack (currently with the edge midpoints and subdivision all happening in world space).
A likely redesign would be, rather than triangles and quads with vertices in parallel stacks, they are instead handled using a single stack of primitive structs, with each primitive potentially having 3-6 vertices.
So, vertex-array handling splits it up into triangles and quads, as before, but then emits primitive structs onto a stack. Each primitive struct has its vertices transformed into Homogeneous Coordinates in the process.
In this case, the primitive-stack loop would start in XYZW space, and do any clipping or decomposition in this space. Handling 6-sided primitives would also allow for primitives to be clipped against the frustum (rather than either kept or discarded; say, a quad being clipped at the edge or corner of the frustum potentially resulting in a hexagon, though cases resulting in pentagons are more likely).
The path split between projected triangles and quads would then happen after this stage. Though, could potentially merge the paths until later, treating triangles as a 3-vertex subset of a quad.
There is a little bit of ambiguity, as ideally one would want to delay the XYZ/W step (turns the homogeneous coordinates into XYZ screen-space) until actually emitting the primitive; but one needs to calculate this earlier to detect whether to subdivide a primitive or to discard small primitives.
At present, all this seems to be the main "high traffic" area; there being generally less primitives to deal with after things like backface and frustum culling (or, discarding primitives which fall below a minimum size limit, *, ...).
*: Say, we can assume that a primitive needs to have an area larger than 1 pixel, and any primitive that only covers a fraction of a pixel is discarded.
...
- anton