Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 03. Feb 2025, 08:31:06
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vnprbu$156vd$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla Thunderbird
On 2/2/2025 7:51 PM, MitchAlsup1 wrote:
On Sun, 2 Feb 2025 22:45:53 +0000, BGB wrote:
 
On 2/2/2025 10:45 AM, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned
accesses in hardware is quite high in cpu clocks.
>
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
>
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
>
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
>
>
IMO, the main costs of unaligned access in hardware:
   Cache may need two banks of cache lines
     lets call them "even" and "odd".
   an access crossing a line boundary may need both an even and odd
line;
   slightly more expensive extract and insert logic.
>
The main costs of not having unaligned access in hardware:
   Code either faults or performs like dog crap;
   Some pieces of code need convoluted workarounds;
   Some algorithms have no choice other than to perform like crap.
>
>
Even if most of the code doesn't need unaligned access, the parts that
do need it, significantly need it to perform well.
>
Well, at least excluding wonk in the ISA, say:
A load/store pair that discards the low-order bits;
An extract/insert instruction that operates on a register pair using the
LOB's of the pointer.
>
In effect, something vaguely akin (AFAIK) to what existed on the DEC
Alpha.
>
>
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
>
>
For an OoO core, any cost difference in the L1 cache here is likely to
be negligible.
>
>
For anything much bigger than a small microcontroller, I would assume
designing a core that handles unaligned access effectively.
>
>
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-
microarchitecture/
>
[about half way down]
>
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
>
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
>
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
>
>
An emulation fault, or something similarly nasty...
>
>
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
>
Say:
   __mem_ld_unaligned:
     ANDI  X15, X10, 7
     BEQ   .aligned, X15, X0
     SUB   X14, X10, X15
     LW    X12, 0(X14)
     LW    X13, 8(X14)
     SLLI  X14, X15, 3
     LI    X17, 64
     SUB   X16, X17, X14
     SRL   X12, X12, X14
     SLL   X13, X13, X16
     OR    X10, X12, X13
     RET
     .aligned:
     LW    X10, 0(X10)
     RET
>
The aligned case being because SRL with 64 will simply give the input
(since (64&63)==0), causing it to break.
>
>
Though not supported by GCC or similar, dedicated __aligned and
__unaligned keywords could help here, to specify which pointers are
aligned (no function call), unaligned (needs function call) and default
(probably aligned).
 All of which vanish when the HW does misaligned accesses.
{{It makes the job of the programmer easier}}
 
Yeah, hence why I do it in hardware...
I like being able to do semi-fast Huffman or Rice bitstreams, or fast LZ77 decompressors...
Otherwise, the Huffman and Rice bitstream code is invariably slower as it needs conditionals to load and shift-in bytes.
And LZ is slower because all the copying needs to be done one byte at a time, rather than 8 or 16 bytes at a time.
Well, and then there is a part of the "magic" of my RP2 decompressor (which can beat LZ4 decode performance on my ISA) being that in more cases it allows a "copy N bytes and ask questions later" strategy, whereas LZ4 has annoyances such as "read a unary-FF coded length", and the need to decode the lengths before one can do the literal copy (with the match distance *after* the literal bytes), and the need to check the nybbles for 15 before knowing whether unary-coded lengths need to be read in (which essentially require a loop), before one can copy the bytes.
The unary-coded match format could pose an issue, but while one could in theory write (in XG1 and XG2):
   //R4=src, R5=dst, R6=read tag (QW)
   NOT    R6, R7
   CTZ.Q  R7, R7  //count trailing zeroes
   BRA.L  R7
   BRA32  .Case0  //xxxx0
   BRA32  .Case1  //xxx01
   BRA32  .Case2  //xx011
   ...
It is not really any better than the more naive:
   TST  1, R6
   BT   .Case0
   TST  2, R6
   BT   .Case1
   TST  4, R6
   BT   .Case2
   ...
Where the first 2 cases are far more common than the later cases (though, reaching Case2 will exceed the cost of the former strategy, as by this point we will have 2 branch misses; using a BRA.L will always have a cache miss).
In Case0 (most common, raw=0..7, match=3..10, dist=1..511) one can do something like:
   .Case1:
   //unpack fields and calculate working addresses (mostly)
   SHAD.Q R6, -1, R16  |  ADD  R4, 2, R19
   SHAD.Q R6, -4, R17  |  SHAD.Q R6, -7, R18
   AND    R16, 7, R16  |  MOV.Q  (R19, 0), R20
   AND    R17, 7, R17  |  AND    R18, 511, R18
   SUB    R5, R18, R23 |  ADD    R19, R16, R4
   ADD    R5, R16, R22 |  ADD    R17, 3, R17
   //deal with self-overlap
   CMPGT  15, R18  //is dist>15 ?
   BF     .Case1_SloCopy  //copy as bytes
   //copy magic
   MOV.Q  (R23, 0), R20
   MOV.Q  (R23, 8), R21
   MOV.Q  R20, (R5, 0)
   ADD    R22, R17, R5
   MOV.Q  R20, (R22, 0)
   MOV.Q  R21, (R22, 8)
   BRA    .nextmatch  //match copy done
Where:
   Case 1 was raw=0..7, match=4..67, dist=1..8191;
   Case 2 was raw=0..7, match=4..515, dist=1..131071;
   Case 3 was 8..128 raw bytes;
   Case 4 was for longer/deeper matches, but mostly unused.
   Case 5 was for 1..3 raw bytes and EOB;
   Case 6 was for up to 4K of raw bytes.
My existing implementations mostly set Len=515, Dist=128K, as the final limits. The latter cases were increasingly infrequent (I had put them in more or less probability order).
In all of the in-use schemes, the values are packed into bit-fields within the tag word (where the exact size of the tag word depends on which case it is).
For the shorter matches, the fastest case is to copy as bytes, whereas for longer matches it is better to try to turn it into a pattern fill.
Not really going to explain it, but... I will assert that this is faster than the equivalent fastest-case paths in an LZ4 decoder, and that the shortest case path is also the most common path (in this case, it is also 2 bytes rather than 3, which is part of how RP2 beats LZ4's compression in many cases).
However, can note that generally LZ4 is faster on x86-64 (though, both can generally still operate in GB/sec territory). But, this is generally with a plain C implementation.
However, I mostly stuck with LZ4 for binaries as it seemed to do better with compressing them, whereas RP2 generally did better with general purpose data compression (but, not as good with binaries).

....

Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal