On 7/26/2024 12:00 PM, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through
the rituals, the memory will be stale.
There is no guarantee of staleness, only a lack of stronger ordering
guarantees.
Probably true enough.
The idea here was more about needing to go through the rituals to get not-stale data, rather than about any ability to ensure that data is stale (not usually a thing).
Granted, if one had deterministically known stale memory, it is probable that software would find some way to become dependent on it.
The weak model is ideal for me. I know how to program for it
And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.
It is not "that" bad, but one needs to know when to evict stuff from the caches.
In my case, mostly for sake of sanity, ended up switching mostly to modulo indexing for everything, as while hashed indexing is generally more efficient (due to being less negatively effected by the tendency to align things), it is less well behaved.
And, for example, hashed indexing in the L1 cache effectively breaks the ability to double-map pages and have changes to one mapping be visible in the others.
Though, it does lead to a questionable practice (when using direct-mapped caches) of "knocking" or "sniping" cache lines:
One can calculate another address based on the first address, where accessing this other address will knock the first out of the caches.
If one uses modulo for both, then adding 512K or 1MB to an address, and accessing this location, is enough to knock a line out of both the L1 and L2 caches (with MMU disabled).
Have added an experimental set-associative cache between the L1 and L2 caches, but initially had to figure out ways to work around its natural tendency to break the ability to use these access patterns (essentially adding some more rules to the mix).
Though, one possible option (in software) is to simply to try to knock more addresses in an attempt to defeat the cache.
Say, try to knock something out of cache with the set-associative thing:
MOV.Q (R4, 0x20000), R18 //knock out of L1
MOV.Q (R4, 0x40000), R16
MOV.Q (R4, 0x50000), R17
MOV.Q (R4, 0x60000), R18
MOV.Q (R4, 0x70000), R19
//at this point, it is knocked out to L2
MOV.Q (R4, 0x100000), R16 //knock out of L2
But... This doesn't work if the area at 'R4' is a virtual memory pages (only works with physical addresses or direct-mapping).
Though, better is to use an "INVDC" instruction followed by a load:
INVDC R4
MOV.Q (R4), R5 //causes line to be evicted
Which signals to the cache hierarchy the intention to flush this address (and also stops the victim-cache module from trying to cache it).
Though, a subsequent load might see whatever was in the cache at the time R5 was loaded, if this is undesirable:
INVDC R4
MOV.Q (R4), R5 //causes dirty line to be evicted
INVDC R4 //L1 cache discards non-dirty line
//next attempt to access address at R4 pulls it from L2 cache.
Currently, there is not a way to express the desire to invalidate a specific cache line from the L2 cache (currently L2 invalidation only applies to the entire L2 cache). At present, L2 invalidation isn't generally needed apart from the RAM-checker and RAM counting (all of the other hardware in my current "SoC" existing on the inside edge of the L2 cache).
Thinking about it, did just go and add intrinsics for these:
void __mem_invdc(void *ptr);
void __mem_invic(void *ptr);
Partly as a possible way to reduce temptation to use cache-knocking.
There is the more naive option of picking a random unrelated chunk of memory and then doing a series of sequential memory loads across the entire memory chunk, but this is less efficient (though generally this makes more sense when the intention is to flush the entire cache).
There is currently also the option of using "no cache" addresses, which also invoke special behavior in the caches; but these require being in supervisor mode or similar (and currently only exists for physical addresses).
Or, saying a virtual memory page to no-cache, in which cases every access is non-caching.
There is a possible need for "Volatile Load" and "Volatile Store" instructions, but these still don't exist.
and it's more efficient
That depends on the hardware.
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware. That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
TSO requires more significant hardware complexity though.
Seems like it would be harder to debug the hardware since:
There is more that has to go on in the hardware for TSO to work;
Software will have higher expectations that it actually work.
Though, if it did work, one could potentially use "stronger" caching in some areas, since the caching would not interfere with the ability of software to maintain memory consistency.
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
and sometimes use cases do not care if they encounter "stale" data.
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
Atomic operations are still a "needs work" area.
Barrier: No proper barrier instruction exists as of yet in BJX2, but there are some (weaker) cache-invalidation instructions.
There is the RISC-V FENCE.I instruction, but as-is this gets turned into an exception (seems to be allowed based on what I read).
The exception handler will then generally proceed to flush the entire cache (so is not a high performance option).
So, in BJX2, there is:
INVIC Rn //invalidate I$, address/tag given in Rn
INVDC Rn //invalidate D$, address/tag given in Rn
Immediate forms are faked by the assembler.
Where, if given a memory address, they will invalidate whatever cache-lines are associated with that address (more selective), or if given an address of -1 or similar, will invalidate the entire cache. INVDC can also be used to invalidate the L2 cache.
However, this does not mean that any dirty cache lines are actually written out to memory; this requires a sweep over the cache.
Say, some chunk of code was modified, make sure I$ sees it:
INVDC -1
//read 64K or so of unrelated memory
INVIC -1
//I$ should now see the up-to-date cache lines.
Or, more precise, single cache line (at address in R4):
INVDC R4 //invalidate line at R4
MOV.Q (R4), R5 //causes R4 to be written back to L2
INVIC R4 //invalidate line in I$
...
Or, one can try to knock the line, say:
MOV.Q (R4, 262144), R5 //knock line out of L1
INVIC R4 //invalidate line in I$
...
But, this only works correctly with a direct-mapped cache.
One current annoyance is when using the hardware rasterizer module, cache flushing is needed when switching between HW rasterization operations and accessing the framebuffer CPU side, otherwise graphical issues may result.
This ends up eating a not-too-small chunk of CPU time, and switching between HW and SW rasterization needs to be kept to a minimum.
For some operations, it is annoying. Currently my implementation lacks any efficient way to deal with GL_POINTS or GL_LINES in this case (though, not really like they are fast in pure SW mode either).
Generally, from the way GL rasterization works, in the general case, one needs to render points and lines by using small/narrow edge walking (generally where the left and right edges are nearly identical just with the right edge offset by 1 pixel, except for the special case of lines where dx>dy, ...).
While it may seem like one could use pixel-plotting or Bressenham line drawing, it is not so simple. Though, generally, these are used infrequently enough that performance doesn't matter too much.
Pixel plotting could be used though if GL_TEXTURE2D is disabled and no fancy blending is used, but then needing to flush the caches if transferring control over the framebuffer would likely eat any advantage over drawing them via edge walking.
- anton