On 11/15/2024 11:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
Do you have any argument that supports this claim.
Strongly consistent memory won't help incompetence.
Strong words to hide lack of arguments?
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
However, local to the CPU core:
Not respecting things like RAW hazards does not seem well advised.
Like, if we store to a location, and then immediately read back from it, one can expect to see the most recently written value, not the previous value. Or, if one stores to two adjacent memory locations, one expects that both stores write the data correctly.
Granted, it is a tradeoff:
Not bothering: Fast, Cheap, but may break expected behavior;
Could naively use NOPs if aliasing is possible, but this is bad.
Add an interlock check, stall the pipeline if it happens:
Works, but can add a noticeable performance penalty;
My attempts at 75 and 100 MHz cores had often done this;
Sadly, memory RAW and WAW hazards are not exactly rare.
Use internal forwarding, so written data is used directly next cycle.
Better performance;
But, has a fairly high cost for the FPGA (*1).
*1: This factor (along with L1 cache sizes) weighs in heavily to why I continue to use 50MHz. Otherwise, I could use 75 MHz, but this internal forwarding logic, and L1 caches with 32K of BRAM (excluding metadata) and 1-cycle access, are not really viable at 75 MHz.
For the L2 cache, which is much bigger, one can use a few extra pad-cycles to access the Block-RAM array. Though, 5 cycle latency for Load/Store operations would be, not good.
Can note that with Block-RAM, usual behavior seems to be that if one tries to read from one port while writing to another port on the same clock edge, if both are at the same location, the prior contents will be returned. This may be a general behavior in Verilog though, rather than a Block-RAM thing (also seems to apply to LUTRAM if accessed in the same pattern; though LUTRAM allows also reading the value via combinatorial logic rather than a clock-edge, which seems to always return the value from the most recent clock-edge).
As I can note, a 4K or 8K L1 cache with stall on RAW or WAW, at 75 MHz, tends to perform worse IME, than a 32K cache running at 50 MHz with no RAW/WAW stall.
Also, trying to increase MHz by increasing instruction latency in many cases was also not ideal for performance.
Granted, if I were to do things the "DEC Alpha" way, I probably could run stuff at 75MHz, but then would likely need the compiler to insert a bunch of strategic NOPs so that the program doesn't break.
For memory ordering, possibly, in my case a case could be made for an "order respecting DRAM cache" via the MMIO interface, say:
F000_01000000..F000_3FFFFFFF
Could be defined to alias with the main RAM map, but with strictly sequential ordering for every memory access across all cores (at the expense of performance).
Where:
0000_00000000..7FFF_FFFFFFFF: Virtual Address Space
8000_00000000..BFFF_FFFFFFFF: Supervisor-Only Virtual Address Space
C000_00000000..CFFF_FFFFFFFF: Physical Address Space, Default Caching
D000_00000000..DFFF_FFFFFFFF: Physical Address Space, Volatile/NoCache
E000_00000000..EFFF_FFFFFFFF: Reserved
F000_00000000..FFFF_FFFFFFFF: MMIO Space
MMIO space is currently fully independent of RAM space.
However, at present:
FFFF_F0000000..FFFF_FFFFFFFF: MMIO Space, as Used for MMIO devices.
So, in theory, remerging RAM IO space into MMIO Space would be possible (well, except that trying to access HW MMIO address ranges via RAM-space access would likely be disallowed).
Can note, MMU disabled:
0000_00000000..0FFF_FFFFFFFF: Same as C000..CFFF space.
1000_00000000..7FFF_FFFFFFFF: Invalid
...
Granted, current scheme does set a limit of 16TB of RAM.
But, biggest FPGA boards I have only have 256MB, so, ...
And, current VA map within TestKern (from memory):
0000_00000000..0000_00FFFFFF: NULL Space
0000_01000000..0000_3FFFFFFF: RAM Range (Identity Mapped)
0000_40000000..0000_BFFFFFFF: Direct Page Mapping (no swap)
0001_00000000..3FFF_FFFFFFFF: Mapped to swapfile, Global
4000_00000000..7FFF_FFFFFFFF: Process Local
Note that, within the RAM-range, the RAM will wrap around. The specifics of the wraparound are used to detect RAM size (this would set an effective limit at 512MB, after which no wraparound would be detected).
Specifics here would need to change if larger RAM sizes were supported.
Not sure how RAM size is detected with DIMM modules. IIRC, with PCs, it was more probe along linearly until one finds an address that no longer returns valid data (say, if one hits the 1GB mark, and gets back 000000 or FFFFFFF or similar, assume end of RAM at 1GB).
One does need to make sure caches (including L2 cache) are flushed during all this, as the caches doing their usual cache thing, may incorrectly detect larger RAM than actually exists.
...
- anton