Re: An execution time puzzle

Liste des GroupesRevenir à c arch 
Sujet : Re: An execution time puzzle
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 10. Mar 2025, 18:14:27
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Mar10.181427@mips.complang.tuwien.ac.at>
References : 1 2
User-Agent : xrn 10.11
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
I have the sequence
>
    1 add    $0x8,%rbx
    2 sub    $0x8,%r13
    3 mov    %rbx,0x0(%r13)
    4 mov    %rdx,%rbx
    5 mov    (%rbx),%rax
    6 jmp    *%rax
    7 mov    %r8,%r15
    8 add    $0x10,%rbx
    9 mov    0x0(%r13),%rbx
   10 mov    -0x10(%r15),%rax
   11 mov    %r15,%rdx
   12 add    $0x8,%r13
   13 sub    $0x8,%rbx
   14 jmp    *%rax
>
The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence.  I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
>
      21969657501      cycles
      27996663866      instructions  #    1.27  insn per cycle
>
I.e., about 11 cycles for the whole sequence of 14 instructions.  In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
>
cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
>
and it tells me that it thinks that 1000 iterations take 2342 cycles:
>
Iterations:        1000
Instructions:      14000
Total Cycles:      2342
Total uOps:        14000
>
Dispatch Width:    6
uOps Per Cycle:    5.98
IPC:               5.98
Block RThroughput: 2.3
>
So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.
>
Even more puzzling: In order to experiment with removing instructions
I recreated this in assembly language:
>
       .text
       .globl main
main:
       mov $threaded, %rdx
       mov $0, %rbx
       mov $(returnstack+8),%r13
       mov %rdx, %r8
docol:  
       add    $0x8,%rbx
       sub    $0x8,%r13
       mov    %rbx,0x0(%r13)
       mov    %rdx,%rbx
       mov    (%rbx),%rax
       jmp    *%rax
outout:
       mov    %r8,%r15
       add    $0x10,%rbx
       mov    0x0(%r13),%rbx
       mov    -0x10(%r15),%rax
       mov    %r15,%rdx
       add    $0x8,%r13
       sub    $0x8,%rbx
       jmp    *%rax
>
       .data
       .quad docol
       .quad 0
threaded:
       .quad outout
returnstack:
       .zero 16,0
>
I assembled and linked this with:
>
gcc xxx.s -Wl,-no-pie
>
I ran the result with
>
perf stat -e cycles -e instructions a.out
>
terminated it with Ctrl-C and the result is:
>
10764822288      cycles
64556841216      instructions #    6.00  insn per cycle
>
I.e., as predicted by llvm-mca.  The main difference AFAICS is that in
the slow version docol and outout are not adjacent, but far from each
other, and returnstack is also not close to threaded (and the two
64-bit words before it that also belong to threaded).

Inserting 4096 bytes before outout and before returnstack did not
change the performance on Zen4.  Another difference is that in the
slow version outout is in rwx memory while docol is in rx memory.  I
am too weak in assembly language to produce such an rwx section (and
too lazy to do it by actually dynamically allocating the rwx memory.

It looks like I have found a microarchitectural pitfall, but it's not
clear what it is.

Yes, looks like a microarchitectural pitfall:

On Zen4, with two different builds of gforth-fast:

gcc-12                      gcc-10
11 cycles/iteration         8 cycles/iteration
mov    %r8,%r15             mov    %r8,%r15
add    $0x10,%rbx           add    $0x10,%rbx
mov    0x0(%r13),%rbx       mov    (%r14),%rbx
mov    -0x10(%r15),%rax     mov    -0x10(%r15),%rax
mov    %r15,%rdx            mov    %r15,%rdx
add    $0x8,%r13            add    $0x8,%r14
sub    $0x8,%rbx            sub    $0x8,%rbx
jmp    *%rax                jmp    *%rax
add    $0x8,%rbx            add    $0x8,%rbx
sub    $0x8,%r13            sub    $0x8,%r14
mov    %rbx,0x0(%r13)       mov    %rbx,(%r14)
mov    %rdx,%rbx            mov    %rdx,%rbx
mov    (%rbx),%rax          mov    (%rbx),%rax
jmp    *%rax                jmp    *%rax

Of course, there is also a difference in where the code and data
pieces are placed.

And here are measurements with the gcc-10 build on various other
microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.

cyc/it
gf   as
 8   2.3  Zen4
 8   3    Zen3
 4   3    Zen2
 9   9    Zen
 2.4 2.4  Golden Cove
 3        Rocket Lake
 6   3    Gracemont
10.6      Tremont

It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my
assembly-language variant (as) that executes the same instruction
sequences.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
10 Mar 25 * An execution time puzzle11Anton Ertl
10 Mar 25 +* Re: An execution time puzzle7Anton Ertl
10 Mar 25 i+* Re: An execution time puzzle2Brett
10 Mar 25 ii`- Re: An execution time puzzle1Anton Ertl
10 Mar 25 i`* Re: An execution time puzzle4Anton Ertl
11 Mar 25 i `* Re: An execution time puzzle3Anton Ertl
11 Mar 25 i  `* Re: An execution time puzzle2Michael S
11 Mar 25 i   `- Re: An execution time puzzle1Anton Ertl
10 Mar 25 `* Re: An execution time puzzle3Robert Finch
10 Mar 25  +- Re: An execution time puzzle1Michael S
11 Mar 25  `- Re: An execution time puzzle1Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal