Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 03. Feb 2025, 11:13:44
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vnq4st$176b4$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 2/3/2025 2:34 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/3/2025 12:55 AM, Anton Ertl wrote:
Rather, have something like an explicit "__unaligned" keyword or
similar, and then use the runtime call for these pointers.
 There are people who think that it is ok to compile *p to anything if
p is not aligned, even on architectures that support unaligned
accesses.  At least one of those people recommended the use of
memcpy(..., ..., sizeof(...)).  Let's see what gcc produces on
rv64gc (where unaligned accesses are guaranteed to work):
 [fedora-starfive:/tmp:111378] cat x.c
#include <string.h>
 long uload(long *p)
{
   long x;
   memcpy(&x,p,sizeof(long));
   return x;
}
[fedora-starfive:/tmp:111379] gcc -O -S x.c
[fedora-starfive:/tmp:111380] cat x.s
         .file   "x.c"
         .option nopic
         .text
         .align  1
         .globl  uload
         .type   uload, @function
uload:
         addi    sp,sp,-16
         lbu     t1,0(a0)
         lbu     a7,1(a0)
         lbu     a6,2(a0)
         lbu     a1,3(a0)
         lbu     a2,4(a0)
         lbu     a3,5(a0)
         lbu     a4,6(a0)
         lbu     a5,7(a0)
         sb      t1,8(sp)
         sb      a7,9(sp)
         sb      a6,10(sp)
         sb      a1,11(sp)
         sb      a2,12(sp)
         sb      a3,13(sp)
         sb      a4,14(sp)
         sb      a5,15(sp)
         ld      a0,8(sp)
         addi    sp,sp,16
         jr      ra
         .size   uload, .-uload
         .ident  "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)"
         .section        .note.GNU-stack,"",@progbits
 Oh boy.  Godbolt tells me that gcc-14.2.0 still does it the same way,
This isn't really the way I would do it, but, granted, it is the way GCC does it...
I guess, one can at least be happy it isn't a call into a copy-slide, say:
   __memcpy_8:
     lb x13, 7(x11)
     sb x13, 7(x10)
   __memcpy_7:
     lb x13, 6(x11)
     sb x13, 6(x10)
   __memcpy_6:
     lb x13, 5(x11)
     sb x13, 5(x10)
   __memcpy_5:
     lb x13, 4(x11)
     sb x13, 4(x10)
   __memcpy_4:
     lb x13, 3(x11)
     sb x13, 3(x10)
   __memcpy_3:
     lb x13, 2(x11)
     sb x13, 2(x10)
   __memcpy_2:
     lb x13, 1(x11)
     sb x13, 1(x10)
   __memcpy_1:
     lb x13, 0(x11)
     sb x13, 0(x10)
   __memcpy_0:
     jr ra
But... then again... In BGBCC for fixed-size "memcpy()":
   memcpy 0..64: will often generate inline.
     direct loads/stores, up to 64 bits at a time
     will use smaller for any tail bytes.
   memcpy 96..512: will call into an auto-generated slide
     (for multiples of 32 bytes).
Will auto-generate a tail copy and then branch into the slide for non-multiples of 32 bytes, for that specific size.
So: __memcpy_512 and __memcpy_480 will go directly to the slide.
Whereas, say, __memcpy_488 will generate a more specialized function that copies 8 bytes then branches into the slide. Reason for multiples of 32 bytes being that this is the minimum copy that does not suffer interlock penalties.
Say, in XG2:
   __memcpy_512:
     MOV.Q  (R5, 480), X20
     MOV.Q  (R5, 488), X21
     MOV.Q  (R5, 496), X22
     MOV.Q  (R5, 504), X23
     MOV.Q  X20, (R4, 480)
     MOV.Q  X21, (R4, 488)
     MOV.Q  X22, (R4, 496)
     MOV.Q  X23, (R4, 504)
  __memcpy_480:
    ...
Then, say:
  __memcpy_488:
     MOV.Q  (R5, 480), X20
     MOV.Q  X20, (R4, 480)
     BRA     __memcpy_480
  __memcpy_496:
     MOV.Q  (R5, 480), X20
     MOV.Q  (R5, 488), X21
     MOV.Q  X20, (R4, 480)
     MOV.Q  X21, (R4, 488)
     BRA     __memcpy_480
     ...
Then:
  __memcpy_492:
     MOV.L  (R5, 488), X20
     MOV.L  X20, (R4, 488)
     BRA     __memcpy_488
  ...
For these later cases, it keeps track of them via bitmaps (with a bit for each size), so that it knows which sizes need to be generated.
In this case, these special functions and slides were the fastest option that also doesn't waste excessive amounts of space (besides the cost of the slide, but this is why it only copies up to 512 bytes).
I ended up not bothering with special aligned cases, as the cost of detecting if the copy was aligned was generally more than that saved from having a separate aligned version.
If bigger than 512, it calls the generic memcpy function...
Which generally then copies chunks of memory (say, 512 bytes), and uses some smaller loops to clean up whatever is left.
Say, chunk sizes (bytes):
   512, 128, 32, 16, 8, 4, 1

whereas clang 9.0.0 and following produce
 [fedora-starfive:/tmp:111383] clang -O -S x.c
[fedora-starfive:/tmp:111384] cat x.s
         .text
         .attribute      4, 16
         .attribute      5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0"
         .file   "x.c"
         .globl  uload                           # -- Begin function uload
         .p2align        1
         .type   uload,@function
uload:                                  # @uload
         .cfi_startproc
# %bb.0:
         ld      a0, 0(a0)
         ret
.Lfunc_end0:
         .size   uload, .Lfunc_end0-uload
         .cfi_endproc
                                         # -- End function
         .ident  "clang version 11.0.0 (Fedora 11.0.0-2.0.riscv64.fc33)"
         .section        ".note.GNU-stack","",@progbits
         .addrsig
 If that is frequently used for unaligned p, this will be slow on the
U74 and P550.  Maybe SiFive should get around to implementing
unaligned accesses more efficiently.
 
Granted, yeah, both options kinda suck, depending.
For BGBCC, it will turn an 8-byte memcpy into a 64 bit load and store, but I am assuming my own core, where the generic case for these is 1-3 clock-cycles, mostly depending on register dependencies.
There was a special case to reduce natively aligned 64-bit load into a 2 cycle operation, but it isn't free.
There is also a special case to reduce 'ADD' to 1 cycle, which mostly helps with RV64G (and its over reliance on 1-cycle ADD/ADDI), mostly by having a secondary shadow ALU whose main goal is to try to have the ADD result ready in the EX1 stage (and another option exists to tune for 1 or 2 cycle shift operations).
Where 2 cycle ADD and Shift operators (or 3 cycle 'LD') have relatively little effect on BJX2, but seemingly a bigger effect on RV64G (which far more often tries to use the results immediately).
For my own extensions (when compiling via BGBCC), some of the "main offender" cases were addressed, again reducing the impact of having 2 or 3 cycle latency here.

Though "memcpy()" is usually a "simple to fix up" scenario.
 General memcpy where both operands may be unaligned in different ways
is not particularly simple.  This also shows up in the fact that Intel
and AMD have failed to make REP MOVSB faster than software approaches
for many cases when I last looked.  Supposedly Intel has had another
go at it, I should measure it again.
 
Possibly; some of the fixup strategies I had seen had assumed both had the same alignment, but maybe harder when the alignment is not equal.
Often, bare byte-copy loops were the worst case fallback.

- anton

Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal