Re: An execution time puzzle

Liste des GroupesRevenir à c arch 
Sujet : Re: An execution time puzzle
De : robfi680 (at) *nospam* gmail.com (Robert Finch)
Groupes : comp.arch
Date : 10. Mar 2025, 18:36:03
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vqn7u5$1f738$1@dont-email.me>
References : 1
User-Agent : Mozilla Thunderbird
On 2025-03-10 3:33 a.m., Anton Ertl wrote:
I have the sequence
       1 add    $0x8,%rbx
      2 sub    $0x8,%r13
      3 mov    %rbx,0x0(%r13)
      4 mov    %rdx,%rbx
      5 mov    (%rbx),%rax
      6 jmp    *%rax
      7 mov    %r8,%r15
      8 add    $0x10,%rbx
      9 mov    0x0(%r13),%rbx
     10 mov    -0x10(%r15),%rax
     11 mov    %r15,%rdx
     12 add    $0x8,%r13
     13 sub    $0x8,%rbx
     14 jmp    *%rax
 The contents of the registers and memory are such that the first jmp
continues at the next instruction in the sequence and the second jmp
continues at the top of the sequence.  I measure this sequence with
perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
         21969657501      cycles
        27996663866      instructions  #    1.27  insn per cycle
 I.e., about 11 cycles for the whole sequence of 14 instructions.  In
trying to unserstand where these 11 cycles come from, I asked
llvm-mca with
 cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
 and it tells me that it thinks that 1000 iterations take 2342 cycles:
 Iterations:        1000
Instructions:      14000
Total Cycles:      2342
Total uOps:        14000
 Dispatch Width:    6
uOps Per Cycle:    5.98
IPC:               5.98
Block RThroughput: 2.3
 So llvm-mca does not predict the actual performance correctly in this
case and I still have no explanation for the 11 cycles.
 Does anybody have an explanation?
 The indirect jumps predict very well (0.03% mispredictions), so that's
not the reason.  So the jumps and all instructions that only produce
(intermediate) results consumed by the jumps should not contribute to
the latency: instructions 5,6,10,14.
 Instruction 8 produces a dead result (overwritten by instruction 9)
and therefore does not contribute to the latency.  Instruction 4 and
(in the previous iteration) 11 produce results that are only used in
latency-irrelevant instructions.  This leaves us with:
       1 add    $0x8,%rbx
      2 sub    $0x8,%r13
      3 mov    %rbx,0x0(%r13)
      7 mov    %r8,%r15
      9 mov    0x0(%r13),%rbx
     12 add    $0x8,%r13
     13 sub    $0x8,%rbx
 One idea is that in this case the hardware alias analysis and 0-cycle
store-to-load forwarding fails for storing and reloading a value
to/from 0(%r13) (instructions 3 and 9), but I would expect a latency
of 6 cycles (1 cycle from instruction 1, 0 from 3, 4 from 9, 1 from
13) from that, not 11.
 - anton
It looks like LLVM is calculating 6 cycles (14000/2342) same as what you would expect. Could there be something else interfering with the performance stat (interrupts?) Does it matter which core it is running on? Performance or economy?

Date Sujet#  Auteur
10 Mar 25 * An execution time puzzle11Anton Ertl
10 Mar 25 +* Re: An execution time puzzle7Anton Ertl
10 Mar 25 i+* Re: An execution time puzzle2Brett
10 Mar 25 ii`- Re: An execution time puzzle1Anton Ertl
10 Mar 25 i`* Re: An execution time puzzle4Anton Ertl
11 Mar 25 i `* Re: An execution time puzzle3Anton Ertl
11 Mar 25 i  `* Re: An execution time puzzle2Michael S
11 Mar 25 i   `- Re: An execution time puzzle1Anton Ertl
10 Mar 25 `* Re: An execution time puzzle3Robert Finch
10 Mar 25  +- Re: An execution time puzzle1Michael S
11 Mar 25  `- Re: An execution time puzzle1Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal