Sujet : Re: "Mini" tags to reduce the number of op codes
De : bohannonindustriesllc (at) *nospam* gmail.com (BGB-Alt)
Groupes : comp.archDate : 11. Apr 2024, 22:25:54
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uv9kh2$1tcks$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Mozilla Thunderbird
On 4/11/2024 9:30 AM, Scott Lurndal wrote:
"Paul A. Clayton" <paaronclayton@gmail.com> writes:
On 4/9/24 8:28 PM, MitchAlsup1 wrote:
BGB-Alt wrote:
[snip]
Things like memcpy/memmove/memset/etc, are function calls in
cases when not directly transformed into register load/store
sequences.
>
My 66000 does not convert them into LD-ST sequences, MM is a
single instruction.
>
I wonder if it would be useful to have an immediate count form of
memory move. Copying fixed-size structures would be able to use an
immediate. Aside from not having to load an immediate for such
cases, there might be microarchitectural benefits to using a
constant. Since fixed-sized copies would likely be limited to
smaller regions (with the possible exception of 8 MiB page copies)
and the overhead of loading a constant for large sizes would be
tiny, only providing a 16-bit immediate form might be reasonable.
It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.
Probably.
One could argue that, likely, setting up a DMA'ed memmove would be expensive enough to make it impractical for small copies (in the category where I am using inline Ld/St sequences or slides).
And, larger copies (where it is most likely to bring benefit) at present mostly seem to be bus/memory bound.
Sort of reminds me of the thing with the external rasterizer module:
The module itself draws stuff quickly, but getting it set-up this far is still expensive enough to limit its benefit. So the main benefit it could bring is seemingly just using it to pull off multi-textured lightmap rendering, which in this case can run at similar speeds to vertex lighting (lightmapped rendering being a somewhat slower option for the software rasterizer).
Well, along with me recently realizing a trick to mimic the look of trilinear filtering without increasing the number of texture fetches (mostly by distorting the interpolation coords, *). This trick could potentially be added to the rasterizer module.
*: Traditional bilinear needs 4 texel fetches and 3 lerps (or, a poor man's approximation with 3 fetches and 2 lerps). Traditional trilinear needs 8 fetches and 7 lerps. The "cheap trick" version only needing the same as bilinear.
One thing that is still needed is a good, fast, and semi-accurate way to pull off the Z=1.0/Z' calculation, as needed for perspective-correct rasterization (affine requires subdivision, which adds cost to the front-end, and interpolating Z directly adds significant distortion for geometry near the near plane).
Granted, this would almost seem to create a need for an OpenGL implementation designed around the assumption of a hardware rasterizer module rather than software span drawing.
Rasterizer module also has its own caching, where it sometimes may be needed to signal it to perform a cache flush (such as when updating the contents of a texture, or needing to access the framebuffer for some other reason, ...).
Potentially, the module could be used to copy/transform images in a framebuffer (such as for GUI rendering), but would need to be somewhat generalized for this (such as supporting using non-power-of-2 raster-images as textures).
Though, another possibility could be adding a dedicated DMA module, or DMA+Image module, or glue dedicated DMA and Raster-Copy functionality onto the rasterizer module (as a separate thing from its normal "walk edges and blend pixels" functionality).
>
Did end up with an intermediate "memcpy slide", which can handle
medium size memcpy and memset style operations by branching into
a slide.
>
MMs and MSs that do not cross page boundaries are ATOMIC. The
entire system
sees only the before or only the after state and nothing in
between.
One might wonder how that atomicity is guaranteed in a
SMP processor...
Dunno there.
My stuff doesn't guerantee atomicity in general.
Only way to ensure that both parties agree on the contents of memory, is that both need to flush their L1 caches or similar.
Or use "No Cache" memory accesses, which is basically implemented as the L1 cache auto-flushing the line as soon as the request finishes; for good effect one also needs to add a few NOPs after the memory access to be sure the L1 has a chance to auto-flush it. Though, another possibility could be to add dedicated non-caching memory access instructions.