Re: An execution time puzzle

Liste des GroupesRevenir à c arch 
Sujet : Re: An execution time puzzle
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 11. Mar 2025, 09:18:17
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Mar11.091817@mips.complang.tuwien.ac.at>
References : 1 2 3
User-Agent : xrn 10.11
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
And here are measurements with the gcc-10 build on various other
microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.
>
cyc/it
gf   as
8   2.3  Zen4
8   3    Zen3
4   3    Zen2
9   9    Zen
2.4 2.4  Golden Cove
3        Rocket Lake
6   3    Gracemont
10.6      Tremont
>
It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my
assembly-language variant (as) that executes the same instruction
sequences.

Given that I have troubles reproducing the slowness in gforth-fast with
assembly language, I took another approach: The Forth source code is:

: foo dup execute-exit ;

So I added a primitive for the combination of DUP and EXECUTE-;S.
This allows exploring the difference between dynamically-generated and
static native code in Gforth.  Here are the different code sequences:

In all versions, the same static docol sequence is used

add    $0x8,%rbx
sub    $0x8,%r14
mov    %rbx,(%r14)
mov    %rdx,%rbx
mov    (%rbx),%rax
jmp    *%rax

For FOO, there are the following different sequences:

1) dynamic code for "dup execute-exit" (sequence)
2) dynamic code for "dup-execute-exit" (primitive)
3) static code for  "dup-execute-exit" (primitive)

dynamic sequence       dynamic primitive     static primitive
mov %r8,%r15         
add $0x10,%rbx         add $0x8,%rbx     
mov (%r14),%rbx        mov (%r14),%rbx       mov (%r14),%rbx   
mov -0x10(%r15),%rax   mov -0x10(%r8),%rax   mov -0x10(%r8),%rax
mov %r15,%rdx          mov %r8,%rdx          mov %r8,%rdx      
add $0x8,%r14          add $0x8,%r14         add $0x8,%r14     
sub $0x8,%rbx          sub $0x8,%rbx         sub $0x8,%rbx     
jmp *%rax              jmp *%rax             jmp *%rax         

To eliminate the difference between the dynamic and static primitive
variants, I also measured a variant where I manually arranged the
dynamic code to not execute the "add" at the start:

4) static-like dynamic code for "dup-execute-exit" (primitive)

I measured this on a Zen3, which has a similar difference between the
Gforth code and the assembly-language code as the Zen4.  The results are:

c/it
8    1) dynamic sequence
8    2) dynamic primitive
2    3) static primitive
8    4) static-like dynamic primitive
3    5) 4) with dynamic docol (see below)
2    6) 5) with aligned dynamic docol (see below)

So apparently the difference between static code and dynamic code
causes the slowdown on Zen3 (and probably on Zen4).

5) One reason could be that the dynamic code is far away in the address
space from the static code of the docol.  E.g., in one execution of 4)
the code for docol starts at 0x00005558a3b5eac3 and the code for the
dup-execute-exit starts at 0x00007f937beae764.  In order to test this
theory, I copied the docol code right behind the dup-execute-exit code
and made the pointer to docol point to it.  And indeed, the speed
increased to 3 cycles/iteration.

So the distance plays a role in Zen3 and probably others; I guess they
do not store the full length of the target in the L1 BTB, and such a
far branch therefore is never promoted to the L1 BTB; the branch
therefore uses the L2 BTB and takes several cycles.

6) There is still one cycle/iteration of difference between 3) and 5),
but I guess this can be explained with the usual sources of
variations, such as code alignment variations.  I tried this theory by
aligning the copied docol code to a 32-byte boundary.  And that indeed
produced 2 cycles/iteration.

Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build.  I don't see
an extension of my BTB theory that would explain this.  So either my
BTB theory is wrong or there is another effect at work.

Here's how you can reproduce this:

For adding the primitive, I added

dup-execute-;s ( xt R:w -- xt ) gforth-internal dup_execute_semis
SET_IP((Xt *)w);
SUPER_END;
VM_JUMP(EXEC1(xt));

to the file prim in Gforth (commit
d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with
gcc-10.2.1).

The measurements were then done on a Ryzen 5800X with:

1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo"

2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo"

5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo"

6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo"

This code always ends in an endless loop, so I pressed Ctrl-C after a
second or so, and then computed

(cycles/instructions)*(instructions/iteration)

where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
10 Mar 25 * An execution time puzzle11Anton Ertl
10 Mar 25 +* Re: An execution time puzzle7Anton Ertl
10 Mar 25 i+* Re: An execution time puzzle2Brett
10 Mar 25 ii`- Re: An execution time puzzle1Anton Ertl
10 Mar 25 i`* Re: An execution time puzzle4Anton Ertl
11 Mar 25 i `* Re: An execution time puzzle3Anton Ertl
11 Mar 25 i  `* Re: An execution time puzzle2Michael S
11 Mar 25 i   `- Re: An execution time puzzle1Anton Ertl
10 Mar 25 `* Re: An execution time puzzle3Robert Finch
10 Mar 25  +- Re: An execution time puzzle1Michael S
11 Mar 25  `- Re: An execution time puzzle1Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal