On 2/3/2025 2:34 AM, Anton Ertl wrote:
BGB <cr88192@gmail.com> writes:
On 2/3/2025 12:55 AM, Anton Ertl wrote:
Rather, have something like an explicit "__unaligned" keyword or
similar, and then use the runtime call for these pointers.
There are people who think that it is ok to compile *p to anything if
p is not aligned, even on architectures that support unaligned
accesses. At least one of those people recommended the use of
memcpy(..., ..., sizeof(...)). Let's see what gcc produces on
rv64gc (where unaligned accesses are guaranteed to work):
[fedora-starfive:/tmp:111378] cat x.c
#include <string.h>
long uload(long *p)
{
long x;
memcpy(&x,p,sizeof(long));
return x;
}
[fedora-starfive:/tmp:111379] gcc -O -S x.c
[fedora-starfive:/tmp:111380] cat x.s
.file "x.c"
.option nopic
.text
.align 1
.globl uload
.type uload, @function
uload:
addi sp,sp,-16
lbu t1,0(a0)
lbu a7,1(a0)
lbu a6,2(a0)
lbu a1,3(a0)
lbu a2,4(a0)
lbu a3,5(a0)
lbu a4,6(a0)
lbu a5,7(a0)
sb t1,8(sp)
sb a7,9(sp)
sb a6,10(sp)
sb a1,11(sp)
sb a2,12(sp)
sb a3,13(sp)
sb a4,14(sp)
sb a5,15(sp)
ld a0,8(sp)
addi sp,sp,16
jr ra
.size uload, .-uload
.ident "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)"
.section .note.GNU-stack,"",@progbits
Oh boy. Godbolt tells me that gcc-14.2.0 still does it the same way,
This isn't really the way I would do it, but, granted, it is the way GCC does it...
I guess, one can at least be happy it isn't a call into a copy-slide, say:
__memcpy_8:
lb x13, 7(x11)
sb x13, 7(x10)
__memcpy_7:
lb x13, 6(x11)
sb x13, 6(x10)
__memcpy_6:
lb x13, 5(x11)
sb x13, 5(x10)
__memcpy_5:
lb x13, 4(x11)
sb x13, 4(x10)
__memcpy_4:
lb x13, 3(x11)
sb x13, 3(x10)
__memcpy_3:
lb x13, 2(x11)
sb x13, 2(x10)
__memcpy_2:
lb x13, 1(x11)
sb x13, 1(x10)
__memcpy_1:
lb x13, 0(x11)
sb x13, 0(x10)
__memcpy_0:
jr ra
But... then again... In BGBCC for fixed-size "memcpy()":
memcpy 0..64: will often generate inline.
direct loads/stores, up to 64 bits at a time
will use smaller for any tail bytes.
memcpy 96..512: will call into an auto-generated slide
(for multiples of 32 bytes).
Will auto-generate a tail copy and then branch into the slide for non-multiples of 32 bytes, for that specific size.
So: __memcpy_512 and __memcpy_480 will go directly to the slide.
Whereas, say, __memcpy_488 will generate a more specialized function that copies 8 bytes then branches into the slide. Reason for multiples of 32 bytes being that this is the minimum copy that does not suffer interlock penalties.
Say, in XG2:
__memcpy_512:
MOV.Q (R5, 480), X20
MOV.Q (R5, 488), X21
MOV.Q (R5, 496), X22
MOV.Q (R5, 504), X23
MOV.Q X20, (R4, 480)
MOV.Q X21, (R4, 488)
MOV.Q X22, (R4, 496)
MOV.Q X23, (R4, 504)
__memcpy_480:
...
Then, say:
__memcpy_488:
MOV.Q (R5, 480), X20
MOV.Q X20, (R4, 480)
BRA __memcpy_480
__memcpy_496:
MOV.Q (R5, 480), X20
MOV.Q (R5, 488), X21
MOV.Q X20, (R4, 480)
MOV.Q X21, (R4, 488)
BRA __memcpy_480
...
Then:
__memcpy_492:
MOV.L (R5, 488), X20
MOV.L X20, (R4, 488)
BRA __memcpy_488
...
For these later cases, it keeps track of them via bitmaps (with a bit for each size), so that it knows which sizes need to be generated.
In this case, these special functions and slides were the fastest option that also doesn't waste excessive amounts of space (besides the cost of the slide, but this is why it only copies up to 512 bytes).
I ended up not bothering with special aligned cases, as the cost of detecting if the copy was aligned was generally more than that saved from having a separate aligned version.
If bigger than 512, it calls the generic memcpy function...
Which generally then copies chunks of memory (say, 512 bytes), and uses some smaller loops to clean up whatever is left.
Say, chunk sizes (bytes):
512, 128, 32, 16, 8, 4, 1
whereas clang 9.0.0 and following produce
[fedora-starfive:/tmp:111383] clang -O -S x.c
[fedora-starfive:/tmp:111384] cat x.s
.text
.attribute 4, 16
.attribute 5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0"
.file "x.c"
.globl uload # -- Begin function uload
.p2align 1
.type uload,@function
uload: # @uload
.cfi_startproc
# %bb.0:
ld a0, 0(a0)
ret
.Lfunc_end0:
.size uload, .Lfunc_end0-uload
.cfi_endproc
# -- End function
.ident "clang version 11.0.0 (Fedora 11.0.0-2.0.riscv64.fc33)"
.section ".note.GNU-stack","",@progbits
.addrsig
If that is frequently used for unaligned p, this will be slow on the
U74 and P550. Maybe SiFive should get around to implementing
unaligned accesses more efficiently.
Granted, yeah, both options kinda suck, depending.
For BGBCC, it will turn an 8-byte memcpy into a 64 bit load and store, but I am assuming my own core, where the generic case for these is 1-3 clock-cycles, mostly depending on register dependencies.
There was a special case to reduce natively aligned 64-bit load into a 2 cycle operation, but it isn't free.
There is also a special case to reduce 'ADD' to 1 cycle, which mostly helps with RV64G (and its over reliance on 1-cycle ADD/ADDI), mostly by having a secondary shadow ALU whose main goal is to try to have the ADD result ready in the EX1 stage (and another option exists to tune for 1 or 2 cycle shift operations).
Where 2 cycle ADD and Shift operators (or 3 cycle 'LD') have relatively little effect on BJX2, but seemingly a bigger effect on RV64G (which far more often tries to use the results immediately).
For my own extensions (when compiling via BGBCC), some of the "main offender" cases were addressed, again reducing the impact of having 2 or 3 cycle latency here.
Though "memcpy()" is usually a "simple to fix up" scenario.
General memcpy where both operands may be unaligned in different ways
is not particularly simple. This also shows up in the fact that Intel
and AMD have failed to make REP MOVSB faster than software approaches
for many cases when I last looked. Supposedly Intel has had another
go at it, I should measure it again.
Possibly; some of the fixup strategies I had seen had assumed both had the same alignment, but maybe harder when the alignment is not equal.
Often, bare byte-copy loops were the worst case fallback.
- anton