Re: Arm ldaxr / stxr loop question

Liste des GroupesRevenir à c arch 
Sujet : Re: Arm ldaxr / stxr loop question
De : chris.m.thomasson.1 (at) *nospam* gmail.com (Chris M. Thomasson)
Groupes : comp.arch
Date : 13. Nov 2024, 02:53:58
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vh10rl$1rohh$10@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : Mozilla Thunderbird
On 11/12/2024 2:55 PM, BGB wrote:
On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
EricP <ThatWouldBeTelling@thevillage.com> wrote:
Any idea what is the advantage for them having all these various
LDxxx and STxxx instructions that only seem to combine a LD or ST
with a fence instruction? Why have
LDAPR Load-Acquire RCpc Register
LDAR Load-Acquire Register
LDLAR LoadLOAcquire Register
>
plus all the variations for byte, half, word, and pair,
instead of just the standard LDx and a general data fence instruction?
>
All this, and much more can be discovered by reading the AMBA
specifications. However, the main point is that the content of the
target address does not have to be transferred to the local cache:
these are remote atomic operations. Quite nice for things like
fire-and-forget counters, for example.
>
 I ended up mostly with a simpler model, IMO:
   Normal / RAM-like: Fetch cache line, write back when evicting;
     Operations: LoadTile, StoreTile, SwapTile,
       LoadPrefetch, StorePrefetch
   Volatile (RAM like): Fetch, operate, write-back;
   MMIO: Remote Load/Store/Swap request;
     Operation is performed on target;
     Currently only supports DWORD and QWORD access;
     Operations are strictly sequential.
 In theory, MMIO access could be added to RAM, but unclear if worth the added cost and complexity of doing so. Could more easily enforce strict consistency.
 The LoadPrefetch and StorePrefetch operations:
   LoadPrefetch, try to perform a load from RAM
     Always responds immediately
     Signals whether it was an L2 hit or L2 Miss.
   StorePrefetch
     Basically like LoadPrefetch
     Signals that the intention is to write to memory.
  In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly match the online descriptions of cache lines.
 Say:
   Tile:
     16 bytes in the current implementation.
     Accessed in even and odd rows
       A memory access may span an even tile and an odd tile;
       The L1 caches need to have a matched pair of tiles for an access.
   Cache Line:
     Usually described as always 32 bytes;
     Descriptions seemed to assume only a single row of lines in caches.
       Generally no mention of allowing for an even/odd scheme.
 Seemingly, a cache that operated with cache lines would use a single row of 32-bit cache lines, with misaligned accesses presumably spanning a pair of adjacent cache lines. To fit with BRAM access patterns, would likely need to split lines in half, and then mirror the relevant tag bits (to allow detecting hit/miss).
 However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported RAM (1R1W).
  My L2 cache operates in a way more like that of traditional descriptions of cache lines, except that they are currently 64 bytes in my L2 cache (and internally subdivided into four 16-byte parts).
 The use of 64 bytes was mostly because this size got the most bandwidth with my DDR interface (with 16 or 32 byte transfers, more cycles are spent overhead; however latency was lower).
 In this case, the L2<->RAM interface:
   512 bit Load Data
   512 bit Store Data
   Load Address
   Store Address
   Request Code (IDLE/LOAD/STORE/SWAP)
   Request Sequence Number
   Response Code (READY/OK/HOLD/FAIL)
   Response Sequence Number
 Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before starting a new request). The sequence numbers avoided needing to return to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled.
 In a SWAP request, the Load and Store are performed end to end.
 General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO as 100MHz SDR, albeit using both posedge and negedge for sampling responses from the DDR chip, so ~ 200 MHz if seen as SDR).
 Theoretically, would be faster to access the chip using the SERDES interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency would dominate.
 Could in theory have used Vivado MIG, but then I would have needed to deal with AXI, and never crossed the threshold of wanting to deal with AXI.
  Between CPU, L2, and various other devices, I am using a ringbus:
   Connections:
     128 bits data;
     48 bits address (96 bits between L1 caches and TLB);
     16 bits: request/response code and flags;
     16 bits: source/dest node and request sequence number;
   Each node has a set of input and output connections;
     Each node may modify a request/response,
       or simply forward from input to output.
     Messages move along at one position per clock cycle.
       Generally also 50 MHz at present (*1).
 *1: Pretty much everything (apart from some hardware interfaces) runs on the same clock. Some devices needed faster clocks. Any slower clocks were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock).
  Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths).
  At present, the highest resolution video modes that can be managed semi- effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.
 Can do 800x600 or similar in RGBI or color-cell modes (640x400 or 640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color.
 Main modes I have ended up using:
   80x25 and 80x50 text/color-cell modes;
     Text and color cell graphics exist in the same mode.
   320x200 hi-color (RGB555);
   640x400 indexed 256 color.
  Trying to go much higher than this, and the combination of ringbus latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache).
 Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so.
 Or, possible but more radical:
   Bolt the VGA output module directly to the L2 cache;
   Could theoretically do 800x600 high-color
     Would eat around 2/3 of total RAM bandwidth.
 Major concern here is that setting resolutions too high would starve the CPU of the ability to access memory (vs the current situation where trying to set higher resolutions mostly results in progressively worse display glitches).
 Logic would need to be in place so that display can't totally hog the RAM interface. If doing so, may also make sense to move from color-cell and block-organized memory to fully raster oriented frame-buffers.
 Though, despite being more wonky / non-standard, the block-oriented framebuffer layout has tended to play along better with memory fetch. A raster oriented framebuffer is much more sensitive to timing and access- latency issues compared with 4x4 or 8x8 pixel blocks, with the display working on an internal cache of around 2 .. 4 rows of blocks.
 Raster generally needs results to be streamed in-order and at a consistent latency, whereas blocks can use hit/miss handling, with a hit/miss probe running ahead of the current raster position (and hopefully able to get the block fetched before it is time to display it). Though, did add logic to the display to avoid sensing new prefetch requests for a block if it is still waiting for a response on that block (mostly as otherwise the VRAM cache was spamming the ringbus with excessive prefetch requests). Where, in this case, the VRAM was using exclusively prefetch requests during screen refresh.
 In practice, it may not matter that much if the hardware framebuffer is block-ordered rather than raster. The OS's display driver is the only thing that really needs to care. Main case where it could arguably "actually matter" being full-screen programs using a DirectDraw style interface, but likely doesn't matter that much if the program is given an off-screen framebuffer to draw into rather than the actual hardware framebuffer (with the contents being copied over during a "buffer swap" event).
 But, as noted, I was mostly using a partly GDI+VfW inspired interface, which seems "mostly OK". Difference in overhead isn't that large; and "Draw this here bitmap onto this HDC" offers a certain level of hardware abstraction; as there is no implicit assumption that the pixel format in ones' bitmap object needs to match the format and layout of the display device.
  Nevermind if for GUI like operation, programs/windows were mostly operating in hi-color, with stuff being awkwardly converted to 256 color during window-stack redraw. Granted, pretty sure old-style Windows didn't work this way, and per-window framebuffers eat a lot of RAM (note that the shell had tabs, but all the tabs share a single window framebuffer; rather each has a separate character cell buffer, and the cells are redrawn to the window buffer either when switching tabs or when more text is printed).
 Had considered option for 256-color or 16 color window buffers (to save RAM), but haven't done so yet (for now, if drawing a 16 or 256 color bitmap, it is internally converted to hi-color). More likely, would switch to 256 color window buffers if using a 256 color output mode (so conversion to 256 color would be handled when drawing the bitmap to the window, rather than later in the process).
  Well, I guess sort of similar wonk that the internal audio mixing is using 16-bit PCM, whereas the output is A-Law (for the hardware loop buffer). But, a case could be made for doing the OS level audio mixing as Binary16.
  Either way, longer term future of my project is uncertain...
 And, unclear if a success or failure.
 It mostly did about as well as I could expect.
 Never did achieve my original goal of "fast enough to run Quake at decent framerates", but partly because younger self didn't realize everything would be stuck at 50 MHz (or that 75 or 100 MHz core would end up needing to be comparably anemic scalar RISC cores; which still can't really get good framerates in Quake, *).
 *: A 100 MHz RV64G core still will not make Quake fast. Extra not helped if one needs to use smaller L1 caches and generate stalls on memory RAW hazards, ...
  Sadly, closest I have gotten thus far involves GLQuake and a hardware rasterizer module. And then the problem that performance is more limited by trying to get geometry processed and fed into the module, than its ability to walk edges and rasterize stuff. Would seemingly ideally need something that could also perform transforms and deal with perspective- correct texture filtering (vs CPU side transforms, and dynamic subdivision + affine texture filtering).
 ...
 
The execution time of each is the same, and the main cost is the fence
synchronizing the Load Store Queue with the cache, flushing the cache
comms queue and waiting for all outstanding cache ops to finish.
>
One other thing to be aware of is that the StoreLoad barrier needed
for sequential consistency is logically part of an LDAR, not part of a
STLR. This is an optimization, because the purpose of a StoreLoad in
that situation is to prevent you from seeing your own stores to a
location before everyone else sees them.
>
Andrew.
 
Humm... It makes me think of, well... does an atomic RMW have implied membars, or are they completely separated akin to the SPARC membar instruction? LOCK'ed RMW on Intel, XCHG instruction aside wrt its implied LOCK prefix, well, they are StoreLoad! Shit.

Date Sujet#  Auteur
28 Oct 24 * Arm ldaxr / stxr loop question135jseigh
31 Oct 24 +- Re: Arm ldaxr / stxr loop question1MitchAlsup1
31 Oct 24 +- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
1 Nov 24 +* Re: Arm ldaxr / stxr loop question123aph
2 Nov 24 i`* Re: Arm ldaxr / stxr loop question122Chris M. Thomasson
8 Nov 24 i `* Re: Arm ldaxr / stxr loop question121Chris M. Thomasson
9 Nov 24 i  `* Re: Arm ldaxr / stxr loop question120Chris M. Thomasson
9 Nov 24 i   +* Re: Arm ldaxr / stxr loop question117Chris M. Thomasson
9 Nov 24 i   i+- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
11 Nov 24 i   i+* Re: Arm ldaxr / stxr loop question5MitchAlsup1
11 Nov 24 i   ii+- Re: Arm ldaxr / stxr loop question1Michael S
11 Nov 24 i   ii`* Re: Arm ldaxr / stxr loop question3jseigh
11 Nov 24 i   ii `* Re: Arm ldaxr / stxr loop question2Chris M. Thomasson
13 Nov 24 i   ii  `- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
11 Nov 24 i   i+- Re: Arm ldaxr / stxr loop question1Michael S
12 Nov 24 i   i+- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
12 Nov 24 i   i+* Re: Arm ldaxr / stxr loop question22aph
13 Nov 24 i   ii+* Re: Arm ldaxr / stxr loop question18Chris M. Thomasson
13 Nov 24 i   iii`* Re: Arm ldaxr / stxr loop question17aph
13 Nov 24 i   iii +* Re: Arm ldaxr / stxr loop question3jseigh
13 Nov 24 i   iii i`* Re: Arm ldaxr / stxr loop question2aph
13 Nov 24 i   iii i `- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
13 Nov 24 i   iii +- Re: Arm ldaxr / stxr loop question1MitchAlsup1
13 Nov 24 i   iii +* Re: Arm ldaxr / stxr loop question2Chris M. Thomasson
13 Nov 24 i   iii i`- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
13 Nov 24 i   iii +* Re: Arm ldaxr / stxr loop question2Chris M. Thomasson
13 Nov 24 i   iii i`- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
13 Nov 24 i   iii `* Re: Arm ldaxr / stxr loop question8Terje Mathisen
13 Nov 24 i   iii  +* Brilliance (was: Arm ldaxr / stxr loop question)4Anton Ertl
13 Nov 24 i   iii  i+- Re: Brilliance1BGB
14 Nov 24 i   iii  i`* Re: Brilliance2Terje Mathisen
17 Nov 24 i   iii  i `- Re: Brilliance1Thomas Koenig
13 Nov 24 i   iii  `* Re: Arm ldaxr / stxr loop question3aph
14 Nov 24 i   iii   `* Re: Arm ldaxr / stxr loop question2Terje Mathisen
15 Nov 24 i   iii    `- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
13 Nov 24 i   ii`* Re: Arm ldaxr / stxr loop question3BGB
13 Nov 24 i   ii `* Re: Arm ldaxr / stxr loop question2Chris M. Thomasson
13 Nov 24 i   ii  `- Re: Arm ldaxr / stxr loop question1Robert Finch
14 Nov 24 i   i`* Re: Arm ldaxr / stxr loop question86Kent Dickey
14 Nov 24 i   i `* Re: Arm ldaxr / stxr loop question85aph
15 Nov 24 i   i  +* Re: Arm ldaxr / stxr loop question81Chris M. Thomasson
15 Nov 24 i   i  i`* Re: Arm ldaxr / stxr loop question80aph
15 Nov 24 i   i  i +- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson
15 Nov 24 i   i  i `* Memory ordering (was: Arm ldaxr / stxr loop question)78Anton Ertl
15 Nov 24 i   i  i  +* Re: Memory ordering44Chris M. Thomasson
15 Nov 24 i   i  i  i`* Re: Memory ordering43Michael S
15 Nov 24 i   i  i  i `* Re: Memory ordering42Chris M. Thomasson
16 Nov 24 i   i  i  i  `* Re: Memory ordering41Chris M. Thomasson
16 Nov 24 i   i  i  i   +- Re: Memory ordering1Chris M. Thomasson
17 Nov 24 i   i  i  i   `* Re: Memory ordering39jseigh
17 Nov 24 i   i  i  i    +* Re: Memory ordering33Anton Ertl
19 Nov 24 i   i  i  i    i`* Re: Memory ordering32Chris M. Thomasson
3 Dec 24 i   i  i  i    i `* Re: Memory ordering31Anton Ertl
3 Dec 24 i   i  i  i    i  `* Re: Memory ordering30jseigh
3 Dec 24 i   i  i  i    i   `* Re: Memory ordering29MitchAlsup1
4 Dec 24 i   i  i  i    i    +* Re: Memory ordering22Stefan Monnier
4 Dec 24 i   i  i  i    i    i+* Re: Memory ordering3MitchAlsup1
4 Dec 24 i   i  i  i    i    ii`* Re: Memory ordering2Stefan Monnier
4 Dec 24 i   i  i  i    i    ii `- Re: Memory ordering1MitchAlsup1
4 Dec 24 i   i  i  i    i    i`* Re: Memory ordering18jseigh
5 Dec 24 i   i  i  i    i    i `* Re: Memory ordering17Chris M. Thomasson
5 Dec 24 i   i  i  i    i    i  +* Re: Memory ordering8jseigh
16 Dec22:48 i   i  i  i    i    i  i`* Re: Memory ordering7Chris M. Thomasson
17 Dec13:33 i   i  i  i    i    i  i `* Re: Memory ordering6jseigh
17 Dec21:38 i   i  i  i    i    i  i  +- Re: Memory ordering1aph
17 Dec21:41 i   i  i  i    i    i  i  `* Re: Memory ordering4Chris M. Thomasson
17 Dec22:45 i   i  i  i    i    i  i   +- Re: Memory ordering1MitchAlsup1
18 Dec12:43 i   i  i  i    i    i  i   `* Re: Memory ordering2jseigh
19 Dec03:48 i   i  i  i    i    i  i    `- Re: Memory ordering1Chris M. Thomasson
19 Dec19:33 i   i  i  i    i    i  `* Re: Memory ordering8MitchAlsup1
19 Dec22:19 i   i  i  i    i    i   `* Re: Memory ordering7Chris M. Thomasson
20 Dec00:59 i   i  i  i    i    i    +* Re: Memory ordering5MitchAlsup1
20 Dec01:21 i   i  i  i    i    i    i+* Re: Memory ordering2Chris M. Thomasson
20 Dec01:25 i   i  i  i    i    i    ii`- Re: Memory ordering1Chris M. Thomasson
20 Dec01:48 i   i  i  i    i    i    i`* Re: Memory ordering2Chris M. Thomasson
20 Dec01:58 i   i  i  i    i    i    i `- Re: Memory ordering1Chris M. Thomasson
20 Dec21:17 i   i  i  i    i    i    `- Re: Memory ordering1Chris M. Thomasson
4 Dec 24 i   i  i  i    i    +- Re: Memory ordering1Chris M. Thomasson
4 Dec 24 i   i  i  i    i    +- Re: Memory ordering1MitchAlsup1
5 Dec 24 i   i  i  i    i    `* Re: Memory ordering4Tim Rentsch
6 Dec 24 i   i  i  i    i     +* Re: Memory ordering2Terje Mathisen
6 Dec 24 i   i  i  i    i     i`- Re: Memory ordering1Tim Rentsch
20 Dec06:08 i   i  i  i    i     `- Re: Memory ordering1Chris M. Thomasson
17 Nov 24 i   i  i  i    +* Re: Memory ordering2Chris M. Thomasson
19 Nov 24 i   i  i  i    i`- Re: Memory ordering1Chris M. Thomasson
18 Nov 24 i   i  i  i    +- Re: Memory ordering1aph
21 Nov 24 i   i  i  i    +- Re: Memory ordering1Chris M. Thomasson
21 Nov 24 i   i  i  i    `- Re: Memory ordering1Chris M. Thomasson
15 Nov 24 i   i  i  +* Re: Memory ordering (was: Arm ldaxr / stxr loop question)2Michael S
15 Nov 24 i   i  i  i`- Re: Memory ordering (was: Arm ldaxr / stxr loop question)1Anton Ertl
15 Nov 24 i   i  i  +* Re: Memory ordering28jseigh
15 Nov 24 i   i  i  i`* Re: Memory ordering27Anton Ertl
15 Nov 24 i   i  i  i +* Re: Memory ordering18Chris M. Thomasson
16 Nov 24 i   i  i  i i`* Re: Memory ordering17Anton Ertl
17 Nov 24 i   i  i  i i `* Re: Memory ordering16Chris M. Thomasson
17 Nov 24 i   i  i  i i  `* Re: Memory ordering15Anton Ertl
18 Nov 24 i   i  i  i i   `* Re: Memory ordering14Chris M. Thomasson
18 Nov 24 i   i  i  i i    `* Re: Memory ordering13Anton Ertl
19 Nov 24 i   i  i  i i     `* Re: Memory ordering12Chris M. Thomasson
19 Nov 24 i   i  i  i i      `* Re: Memory ordering11Chris M. Thomasson
26 Nov 24 i   i  i  i i       +* Re: Memory ordering4Chris M. Thomasson
3 Dec 24 i   i  i  i i       `* Re: Memory ordering6Anton Ertl
15 Nov 24 i   i  i  i +* Re: Memory ordering7BGB
17 Nov 24 i   i  i  i `- Re: Memory ordering1Tim Rentsch
16 Nov 24 i   i  i  +- Re: Memory ordering (was: Arm ldaxr / stxr loop question)1Anton Ertl
16 Nov 24 i   i  i  +- Re: Memory ordering (was: Arm ldaxr / stxr loop question)1Lawrence D'Oliveiro
18 Nov 24 i   i  i  `- Re: Memory ordering1aph
21 Nov 24 i   i  `* Re: Arm ldaxr / stxr loop question3Kent Dickey
9 Nov 24 i   `* Re: Arm ldaxr / stxr loop question2jseigh
8 Nov 24 +* Re: Arm ldaxr / stxr loop question8Lawrence D'Oliveiro
20 Dec10:11 `- Re: Arm ldaxr / stxr loop question1Chris M. Thomasson

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal