Liste des Groupes | Revenir à c arch |
On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:Humm... It makes me think of, well... does an atomic RMW have implied membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its implied LOCK prefix, well, they are StoreLoad! Shit.EricP <ThatWouldBeTelling@thevillage.com> wrote:I ended up mostly with a simpler model, IMO:Any idea what is the advantage for them having all these various>
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
>
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like
fire-and-forget counters, for example.
>
Normal / RAM-like: Fetch cache line, write back when evicting;
Operations: LoadTile, StoreTile, SwapTile,
LoadPrefetch, StorePrefetch
Volatile (RAM like): Fetch, operate, write-back;
MMIO: Remote Load/Store/Swap request;
Operation is performed on target;
Currently only supports DWORD and QWORD access;
Operations are strictly sequential.
In theory, MMIO access could be added to RAM, but unclear if worth the added cost and complexity of doing so. Could more easily enforce strict consistency.
The LoadPrefetch and StorePrefetch operations:
LoadPrefetch, try to perform a load from RAM
Always responds immediately
Signals whether it was an L2 hit or L2 Miss.
StorePrefetch
Basically like LoadPrefetch
Signals that the intention is to write to memory.
In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly match the online descriptions of cache lines.
Say:
Tile:
16 bytes in the current implementation.
Accessed in even and odd rows
A memory access may span an even tile and an odd tile;
The L1 caches need to have a matched pair of tiles for an access.
Cache Line:
Usually described as always 32 bytes;
Descriptions seemed to assume only a single row of lines in caches.
Generally no mention of allowing for an even/odd scheme.
Seemingly, a cache that operated with cache lines would use a single row of 32-bit cache lines, with misaligned accesses presumably spanning a pair of adjacent cache lines. To fit with BRAM access patterns, would likely need to split lines in half, and then mirror the relevant tag bits (to allow detecting hit/miss).
However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported RAM (1R1W).
My L2 cache operates in a way more like that of traditional descriptions of cache lines, except that they are currently 64 bytes in my L2 cache (and internally subdivided into four 16-byte parts).
The use of 64 bytes was mostly because this size got the most bandwidth with my DDR interface (with 16 or 32 byte transfers, more cycles are spent overhead; however latency was lower).
In this case, the L2<->RAM interface:
512 bit Load Data
512 bit Store Data
Load Address
Store Address
Request Code (IDLE/LOAD/STORE/SWAP)
Request Sequence Number
Response Code (READY/OK/HOLD/FAIL)
Response Sequence Number
Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before starting a new request). The sequence numbers avoided needing to return to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.
In a SWAP request, the Load and Store are performed end to end.
General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO as 100MHz SDR, albeit using both posedge and negedge for sampling responses from the DDR chip, so ~ 200 MHz if seen as SDR).
Theoretically, would be faster to access the chip using the SERDES interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency would dominate.
Could in theory have used Vivado MIG, but then I would have needed to deal with AXI, and never crossed the threshold of wanting to deal with AXI.
Between CPU, L2, and various other devices, I am using a ringbus:
Connections:
128 bits data;
48 bits address (96 bits between L1 caches and TLB);
16 bits: request/response code and flags;
16 bits: source/dest node and request sequence number;
Each node has a set of input and output connections;
Each node may modify a request/response,
or simply forward from input to output.
Messages move along at one position per clock cycle.
Generally also 50 MHz at present (*1).
*1: Pretty much everything (apart from some hardware interfaces) runs on the same clock. Some devices needed faster clocks. Any slower clocks were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).
Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).
At present, the highest resolution video modes that can be managed semi- effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.
Can do 800x600 or similar in RGBI or color-cell modes (640x400 or 640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color.
Main modes I have ended up using:
80x25 and 80x50 text/color-cell modes;
Text and color cell graphics exist in the same mode.
320x200 hi-color (RGB555);
640x400 indexed 256 color.
Trying to go much higher than this, and the combination of ringbus latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).
Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.
Or, possible but more radical:
Bolt the VGA output module directly to the L2 cache;
Could theoretically do 800x600 high-color
Would eat around 2/3 of total RAM bandwidth.
Major concern here is that setting resolutions too high would starve the CPU of the ability to access memory (vs the current situation where trying to set higher resolutions mostly results in progressively worse display glitches).
Logic would need to be in place so that display can't totally hog the RAM interface. If doing so, may also make sense to move from color-cell and block-organized memory to fully raster oriented frame-buffers.
Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A raster oriented framebuffer is much more sensitive to timing and access- latency issues compared with 4x4 or 8x8 pixel blocks, with the display working on an internal cache of around 2 .. 4 rows of blocks.
Raster generally needs results to be streamed in-order and at a consistent latency, whereas blocks can use hit/miss handling, with a hit/miss probe running ahead of the current raster position (and hopefully able to get the block fetched before it is time to display it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.
In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only thing that really needs to care. Main case where it could arguably "actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).
But, as noted, I was mostly using a partly GDI+VfW inspired interface, which seems "mostly OK". Difference in overhead isn't that large; and "Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in ones' bitmap object needs to match the format and layout of the display device.
Nevermind if for GUI like operation, programs/windows were mostly operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows didn't work this way, and per-window framebuffers eat a lot of RAM (note that the shell had tabs, but all the tabs share a single window framebuffer; rather each has a separate character cell buffer, and the cells are redrawn to the window buffer either when switching tabs or when more text is printed).
Had considered option for 256-color or 16 color window buffers (to save RAM), but haven't done so yet (for now, if drawing a 16 or 256 color bitmap, it is internally converted to hi-color). More likely, would switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).
Well, I guess sort of similar wonk that the internal audio mixing is using 16-bit PCM, whereas the output is A-Law (for the hardware loop buffer). But, a case could be made for doing the OS level audio mixing as Binary16.
Either way, longer term future of my project is uncertain...
And, unclear if a success or failure.
It mostly did about as well as I could expect.
Never did achieve my original goal of "fast enough to run Quake at decent framerates", but partly because younger self didn't realize everything would be stuck at 50 MHz (or that 75 or 100 MHz core would end up needing to be comparably anemic scalar RISC cores; which still can't really get good framerates in Quake, *).
*: A 100 MHz RV64G core still will not make Quake fast. Extra not helped if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...
Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited by trying to get geometry processed and fed into the module, than its ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with perspective- correct texture filtering (vs CPU side transforms, and dynamic subdivision + affine texture filtering).
...
The execution time of each is the same, and the main cost is the fence>
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
>
Andrew.
Les messages affichés proviennent d'usenet.