Newsportal USENET - Re: An execution time puzzle

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

I have the sequence
>
1 add $0x8,%rbx
2 sub $0x8,%r13
3 mov %rbx,0x0(%r13)
4 mov %rdx,%rbx
5 mov (%rbx),%rax
6 jmp *%rax
7 mov %r8,%r15
8 add $0x10,%rbx
9 mov 0x0(%r13),%rbx
   10 mov -0x10(%r15),%rax
   11 mov %r15,%rdx
   12 add $0x8,%r13
   13 sub $0x8,%rbx
   14 jmp *%rax
>
The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence. I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
>
21969657501 cycles
27996663866 instructions # 1.27 insn per cycle
>
I.e., about 11 cycles for the whole sequence of 14 instructions. In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
>
cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
>
and it tells me that it thinks that 1000 iterations take 2342 cycles:
>
Iterations: 1000
Instructions: 14000
Total Cycles: 2342
Total uOps: 14000
>
Dispatch Width: 6
uOps Per Cycle: 5.98
IPC:    5.98
Block RThroughput: 2.3
>
So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.

Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:

.text
.globl main
main:
mov $threaded, %rdx
mov $0, %rbx
mov $(returnstack+8),%r13
mov %rdx, %r8
docol:
add $0x8,%rbx
sub $0x8,%r13
mov %rbx,0x0(%r13)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
outout:
mov %r8,%r15
add $0x10,%rbx
mov 0x0(%r13),%rbx
mov -0x10(%r15),%rax
mov %r15,%rdx
add $0x8,%r13
sub $0x8,%rbx
jmp *%rax

.data
.quad docol
.quad 0
threaded:
.quad outout
returnstack:
.zero 16,0

I assembled and linked this with:

gcc xxx.s -Wl,-no-pie

I ran the result with

perf stat -e cycles -e instructions a.out

terminated it with Ctrl-C and the result is:

10764822288 cycles
64556841216 instructions # 6.00 insn per cycle

I.e., as predicted by llvm-mca. The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).

It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date	Sujet	#	Auteur
10 Mar 25	An execution time puzzle	11	Anton Ertl
10 Mar 25	Re: An execution time puzzle	7	Anton Ertl
10 Mar 25	Re: An execution time puzzle	2	Brett
10 Mar 25	Re: An execution time puzzle	1	Anton Ertl
10 Mar 25	Re: An execution time puzzle	4	Anton Ertl
11 Mar 25	Re: An execution time puzzle	3	Anton Ertl
11 Mar 25	Re: An execution time puzzle	2	Michael S
11 Mar 25	Re: An execution time puzzle	1	Anton Ertl
10 Mar 25	Re: An execution time puzzle	3	Robert Finch
10 Mar 25	Re: An execution time puzzle	1	Michael S
11 Mar 25	Re: An execution time puzzle	1	Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal