Newsportal USENET - Re: Cost of handling misaligned access

On 2/2/2025 7:51 PM, MitchAlsup1 wrote:

On Sun, 2 Feb 2025 22:45:53 +0000, BGB wrote:

On 2/2/2025 10:45 AM, EricP wrote:
As you can see in the article below, the cost of NOT handling misaligned
accesses in hardware is quite high in cpu clocks.
>
To my eye, the incremental cost of adding hardware support for
misaligned
to the AGU and cache data path should be quite low. The alignment
shifter
is basically the same: assuming a 64-byte cache line, LD still has to
shift any of the 64 bytes into position 0, and reverse for ST.
>
The incremental cost is in a sequencer in the AGU for handling cache
line and possibly virtual page straddles, and a small byte shifter to
left shift the high order bytes. The AGU sequencer needs to know if the
line straddles a page boundary, if not then increment the 6-bit physical
line number within the 4 kB physical frame number, if yes then increment
virtual page number and TLB lookup again and access the first line.
(Slightly more if multiple page sizes are supported, but same idea.)
For a load AGU merges the low and high fragments and forwards.
>
I don't think there are line straddle consequences for coherence because
there is no ordering guarantees for misaligned accesses.
>
>
IMO, the main costs of unaligned access in hardware:
   Cache may need two banks of cache lines
     lets call them "even" and "odd".
   an access crossing a line boundary may need both an even and odd
line;
   slightly more expensive extract and insert logic.
>
The main costs of not having unaligned access in hardware:
   Code either faults or performs like dog crap;
   Some pieces of code need convoluted workarounds;
   Some algorithms have no choice other than to perform like crap.
>
>
Even if most of the code doesn't need unaligned access, the parts that
do need it, significantly need it to perform well.
>
Well, at least excluding wonk in the ISA, say:
A load/store pair that discards the low-order bits;
An extract/insert instruction that operates on a register pair using the
LOB's of the pointer.
>
In effect, something vaguely akin (AFAIK) to what existed on the DEC
Alpha.
>
>
The hardware cost appears trivial, especially within an OoO core.
So there doesn't appear to be any reason to not handle this.
Am I missing something?
>
>
For an OoO core, any cost difference in the L1 cache here is likely to
be negligible.
>
>
For anything much bigger than a small microcontroller, I would assume
designing a core that handles unaligned access effectively.
>
>
https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-
microarchitecture/
>
[about half way down]
>
"Before accessing cache, load addresses have to be checked against
older stores (and vice versa) to ensure proper ordering. If there is a
dependency, P550 can only do fast store forwarding if the load and store
addresses match exactly and both accesses are naturally aligned.
Any unaligned access, dependent or not, confuses P550 for hundreds of
cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
An unaligned load takes 1062 cycles, an unaligned store takes
741 cycles, and the two together take over 1800 cycles.
>
This terrible unaligned access behavior is atypical even for low power
cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
dependent accesses that are both misaligned.
>
Digging deeper with performance counters reveals executing each
unaligned
load instruction results in ~505 executed instructions. P550 almost
certainly doesn’t have hardware support for unaligned accesses.
Rather, it’s likely raising a fault and letting an operating system
handler emulate it in software."
>
>
An emulation fault, or something similarly nasty...
>
>
At that point, even turning any potentially unaligned load or store into
a runtime call is likely to be a lot cheaper.
>
Say:
   __mem_ld_unaligned:
     ANDI X15, X10, 7
     BEQ   .aligned, X15, X0
     SUB   X14, X10, X15
     LW    X12, 0(X14)
     LW    X13, 8(X14)
     SLLI X14, X15, 3
     LI    X17, 64
     SUB   X16, X17, X14
     SRL   X12, X12, X14
     SLL   X13, X13, X16
     OR    X10, X12, X13
     RET
     .aligned:
     LW    X10, 0(X10)
     RET
>
The aligned case being because SRL with 64 will simply give the input
(since (64&63)==0), causing it to break.
>
>
Though not supported by GCC or similar, dedicated __aligned and
__unaligned keywords could help here, to specify which pointers are
aligned (no function call), unaligned (needs function call) and default
(probably aligned).
All of which vanish when the HW does misaligned accesses.
{{It makes the job of the programmer easier}}

Yeah, hence why I do it in hardware...
I like being able to do semi-fast Huffman or Rice bitstreams, or fast LZ77 decompressors...
Otherwise, the Huffman and Rice bitstream code is invariably slower as it needs conditionals to load and shift-in bytes.
And LZ is slower because all the copying needs to be done one byte at a time, rather than 8 or 16 bytes at a time.
Well, and then there is a part of the "magic" of my RP2 decompressor (which can beat LZ4 decode performance on my ISA) being that in more cases it allows a "copy N bytes and ask questions later" strategy, whereas LZ4 has annoyances such as "read a unary-FF coded length", and the need to decode the lengths before one can do the literal copy (with the match distance *after* the literal bytes), and the need to check the nybbles for 15 before knowing whether unary-coded lengths need to be read in (which essentially require a loop), before one can copy the bytes.
The unary-coded match format could pose an issue, but while one could in theory write (in XG1 and XG2):
   //R4=src, R5=dst, R6=read tag (QW)
   NOT R6, R7
   CTZ.Q R7, R7 //count trailing zeroes
   BRA.L R7
   BRA32 .Case0 //xxxx0
   BRA32 .Case1 //xxx01
   BRA32 .Case2 //xx011
   ...
It is not really any better than the more naive:
   TST 1, R6
   BT   .Case0
   TST 2, R6
   BT   .Case1
   TST 4, R6
   BT   .Case2
   ...
Where the first 2 cases are far more common than the later cases (though, reaching Case2 will exceed the cost of the former strategy, as by this point we will have 2 branch misses; using a BRA.L will always have a cache miss).
In Case0 (most common, raw=0..7, match=3..10, dist=1..511) one can do something like:
   .Case1:
   //unpack fields and calculate working addresses (mostly)
   SHAD.Q R6, -1, R16 | ADD R4, 2, R19
   SHAD.Q R6, -4, R17 | SHAD.Q R6, -7, R18
   AND R16, 7, R16 | MOV.Q (R19, 0), R20
   AND R17, 7, R17 | AND R18, 511, R18
   SUB R5, R18, R23 | ADD R19, R16, R4
   ADD R5, R16, R22 | ADD R17, 3, R17
   //deal with self-overlap
   CMPGT 15, R18 //is dist>15 ?
   BF    .Case1_SloCopy //copy as bytes
   //copy magic
   MOV.Q (R23, 0), R20
   MOV.Q (R23, 8), R21
   MOV.Q R20, (R5, 0)
   ADD R22, R17, R5
   MOV.Q R20, (R22, 0)
   MOV.Q R21, (R22, 8)
   BRA .nextmatch //match copy done
Where:
   Case 1 was raw=0..7, match=4..67, dist=1..8191;
   Case 2 was raw=0..7, match=4..515, dist=1..131071;
   Case 3 was 8..128 raw bytes;
   Case 4 was for longer/deeper matches, but mostly unused.
   Case 5 was for 1..3 raw bytes and EOB;
   Case 6 was for up to 4K of raw bytes.
My existing implementations mostly set Len=515, Dist=128K, as the final limits. The latter cases were increasingly infrequent (I had put them in more or less probability order).
In all of the in-use schemes, the values are packed into bit-fields within the tag word (where the exact size of the tag word depends on which case it is).
For the shorter matches, the fastest case is to copy as bytes, whereas for longer matches it is better to try to turn it into a pattern fill.
Not really going to explain it, but... I will assert that this is faster than the equivalent fastest-case paths in an LZ4 decoder, and that the shortest case path is also the most common path (in this case, it is also 2 bytes rather than 3, which is part of how RP2 beats LZ4's compression in many cases).
However, can note that generally LZ4 is faster on x86-64 (though, both can generally still operate in GB/sec territory). But, this is generally with a plain C implementation.
However, I mostly stuck with LZ4 for binaries as it seemed to do better with compressing them, whereas RP2 generally did better with general purpose data compression (but, not as good with binaries).

....

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen