Sujet : Re: Computer architects leaving Intel...
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 09. Sep 2024, 13:28:13
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Sep9.142813@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : xrn 10.11
Michael S <
already5chosen@yahoo.com> writes:
On Mon, 09 Sep 2024 10:30:34 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
One would hope so, but here's what happens with gcc-12:
#include <string.h>
void foo1(char *p, char* q)
{
memcpy(p,q,32);
}
void foo2(char *p, char* q)
{
memmove(p,q,32);
}
gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:
0000000000000000 <foo1>:
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1)
1a: 00 00 00 00
1e: 66 90 xchg %ax,%ax
0000000000000020 <foo2>:
20: ba 20 00 00 00 mov $0x20,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
The jmp in line 25 is probably a tail-call to memmove().
My guess is that xmm registers and unrolling are used here rather than
ymm registers because waking up the second 128 bits takes time. But
even with that, the code uses two different registers, and if
scheduled differently, could be used for implementing foo2():
0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0
8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1
4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi)
d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi)
12: c3 ret
- anton
>
Try -march instead of -mavx2. E.g. -march=haswell
Sometimes gcc is beyond logic.
For gcc -O3 -march=haswell I got the same result (with gcc-12). I
also tried -march=x86-64-v3 with the same result.
But gcc -O3 -march=x86-64-v4 produced:
0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 f8 77 vzeroupper
b: c3 ret
c: 0f 1f 40 00 nopl 0x0(%rax)
0000000000000010 <foo2>:
10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
18: c5 f8 77 vzeroupper
1b: c3 ret
And when changing the length to 64:
0000000000000000 <foo1>:
0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
c: c5 f8 77 vzeroupper
f: c3 ret
0000000000000010 <foo2>:
10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0
16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi)
1c: c5 f8 77 vzeroupper
1f: c3 ret
But when changing the length to 63:
0000000000000000 <foo1>:
0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0
4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi)
8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1
d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi)
12: c5 f8 77 vzeroupper
15: c3 ret
16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1)
1d: 00 00 00
0000000000000020 <foo2>:
20: ba 3f 00 00 00 mov $0x3f,%edx
25: e9 00 00 00 00 jmp 2a <foo2+0xa>
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>