Newsportal USENET - Re: Cost of handling misaligned access

Re: Cost of handling misaligned access

Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 03. Feb 2025, 11:13:44

Autres entêtes

Organisation : A noiseless patient Spider
Message-ID : <vnq4st$176b4$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird

On 2/3/2025 2:34 AM, Anton Ertl wrote:

BGB <cr88192@gmail.com> writes:
On 2/3/2025 12:55 AM, Anton Ertl wrote:
Rather, have something like an explicit "__unaligned" keyword or
similar, and then use the runtime call for these pointers.
There are people who think that it is ok to compile *p to anything if
p is not aligned, even on architectures that support unaligned
accesses. At least one of those people recommended the use of
memcpy(..., ..., sizeof(...)). Let's see what gcc produces on
rv64gc (where unaligned accesses are guaranteed to work):
[fedora-starfive:/tmp:111378] cat x.c
#include <string.h>
long uload(long *p)
{
   long x;
   memcpy(&x,p,sizeof(long));
   return x;
}
[fedora-starfive:/tmp:111379] gcc -O -S x.c
[fedora-starfive:/tmp:111380] cat x.s
   .file   "x.c"
   .option nopic
   .text
   .align 1
   .globl uload
   .type   uload, @function
uload:
   addi sp,sp,-16
   lbu    t1,0(a0)
   lbu    a7,1(a0)
   lbu    a6,2(a0)
   lbu    a1,3(a0)
   lbu    a2,4(a0)
   lbu    a3,5(a0)
   lbu    a4,6(a0)
   lbu    a5,7(a0)
   sb t1,8(sp)
   sb a7,9(sp)
   sb a6,10(sp)
   sb a1,11(sp)
   sb a2,12(sp)
   sb a3,13(sp)
   sb a4,14(sp)
   sb a5,15(sp)
   ld a0,8(sp)
   addi sp,sp,16
   jr ra
   .size   uload, .-uload
   .ident "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)"
   .section .note.GNU-stack,"",@progbits
Oh boy. Godbolt tells me that gcc-14.2.0 still does it the same way,

This isn't really the way I would do it, but, granted, it is the way GCC does it...
I guess, one can at least be happy it isn't a call into a copy-slide, say:
   __memcpy_8:
   lb x13, 7(x11)
   sb x13, 7(x10)
   __memcpy_7:
   lb x13, 6(x11)
   sb x13, 6(x10)
   __memcpy_6:
   lb x13, 5(x11)
   sb x13, 5(x10)
   __memcpy_5:
   lb x13, 4(x11)
   sb x13, 4(x10)
   __memcpy_4:
   lb x13, 3(x11)
   sb x13, 3(x10)
   __memcpy_3:
   lb x13, 2(x11)
   sb x13, 2(x10)
   __memcpy_2:
   lb x13, 1(x11)
   sb x13, 1(x10)
   __memcpy_1:
   lb x13, 0(x11)
   sb x13, 0(x10)
   __memcpy_0:
   jr ra
But... then again... In BGBCC for fixed-size "memcpy()":
   memcpy 0..64: will often generate inline.
   direct loads/stores, up to 64 bits at a time
   will use smaller for any tail bytes.
   memcpy 96..512: will call into an auto-generated slide
   (for multiples of 32 bytes).
Will auto-generate a tail copy and then branch into the slide for non-multiples of 32 bytes, for that specific size.
So: __memcpy_512 and __memcpy_480 will go directly to the slide.
Whereas, say, __memcpy_488 will generate a more specialized function that copies 8 bytes then branches into the slide. Reason for multiples of 32 bytes being that this is the minimum copy that does not suffer interlock penalties.
Say, in XG2:
   __memcpy_512:
   MOV.Q (R5, 480), X20
   MOV.Q (R5, 488), X21
   MOV.Q (R5, 496), X22
   MOV.Q (R5, 504), X23
   MOV.Q X20, (R4, 480)
   MOV.Q X21, (R4, 488)
   MOV.Q X22, (R4, 496)
   MOV.Q X23, (R4, 504)
__memcpy_480:
...
Then, say:
__memcpy_488:
   MOV.Q (R5, 480), X20
   MOV.Q X20, (R4, 480)
   BRA    __memcpy_480
__memcpy_496:
   MOV.Q (R5, 480), X20
   MOV.Q (R5, 488), X21
   MOV.Q X20, (R4, 480)
   MOV.Q X21, (R4, 488)
   BRA    __memcpy_480
   ...
Then:
__memcpy_492:
   MOV.L (R5, 488), X20
   MOV.L X20, (R4, 488)
   BRA    __memcpy_488
...
For these later cases, it keeps track of them via bitmaps (with a bit for each size), so that it knows which sizes need to be generated.
In this case, these special functions and slides were the fastest option that also doesn't waste excessive amounts of space (besides the cost of the slide, but this is why it only copies up to 512 bytes).
I ended up not bothering with special aligned cases, as the cost of detecting if the copy was aligned was generally more than that saved from having a separate aligned version.
If bigger than 512, it calls the generic memcpy function...
Which generally then copies chunks of memory (say, 512 bytes), and uses some smaller loops to clean up whatever is left.
Say, chunk sizes (bytes):
   512, 128, 32, 16, 8, 4, 1

whereas clang 9.0.0 and following produce
[fedora-starfive:/tmp:111383] clang -O -S x.c
[fedora-starfive:/tmp:111384] cat x.s
   .text
   .attribute 4, 16
   .attribute 5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0"
   .file   "x.c"
   .globl uload    # -- Begin function uload
   .p2align 1
   .type   uload,@function
uload: # @uload
   .cfi_startproc
# %bb.0:
   ld a0, 0(a0)
   ret
.Lfunc_end0:
   .size   uload, .Lfunc_end0-uload
   .cfi_endproc
   # -- End function
   .ident "clang version 11.0.0 (Fedora 11.0.0-2.0.riscv64.fc33)"
   .section ".note.GNU-stack","",@progbits
   .addrsig
If that is frequently used for unaligned p, this will be slow on the
U74 and P550. Maybe SiFive should get around to implementing
unaligned accesses more efficiently.

Granted, yeah, both options kinda suck, depending.
For BGBCC, it will turn an 8-byte memcpy into a 64 bit load and store, but I am assuming my own core, where the generic case for these is 1-3 clock-cycles, mostly depending on register dependencies.
There was a special case to reduce natively aligned 64-bit load into a 2 cycle operation, but it isn't free.
There is also a special case to reduce 'ADD' to 1 cycle, which mostly helps with RV64G (and its over reliance on 1-cycle ADD/ADDI), mostly by having a secondary shadow ALU whose main goal is to try to have the ADD result ready in the EX1 stage (and another option exists to tune for 1 or 2 cycle shift operations).
Where 2 cycle ADD and Shift operators (or 3 cycle 'LD') have relatively little effect on BJX2, but seemingly a bigger effect on RV64G (which far more often tries to use the results immediately).
For my own extensions (when compiling via BGBCC), some of the "main offender" cases were addressed, again reducing the impact of having 2 or 3 cycle latency here.

Though "memcpy()" is usually a "simple to fix up" scenario.
General memcpy where both operands may be unaligned in different ways
is not particularly simple. This also shows up in the fact that Intel
and AMD have failed to make REP MOVSB faster than software approaches
for many cases when I last looked. Supposedly Intel has had another
go at it, I should measure it again.

Possibly; some of the fixup strategies I had seen had assumed both had the same alignment, but maybe harder when the alignment is not equal.
Often, bare byte-copy loops were the worst case fallback.

- anton

Les messages affichés proviennent d'usenet.

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen