anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
And here are measurements with the gcc-10 build on various other
microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.
>
cyc/it
gf as
8 2.3 Zen4
8 3 Zen3
4 3 Zen2
9 9 Zen
2.4 2.4 Golden Cove
3 Rocket Lake
6 3 Gracemont
10.6 Tremont
>
It's interesting that several microarchitectures show a difference
between the version of the code produced by gforth-fast (gf) and my
assembly-language variant (as) that executes the same instruction
sequences.
Given that I have troubles reproducing the slowness in gforth-fast with
assembly language, I took another approach: The Forth source code is:
: foo dup execute-exit ;
So I added a primitive for the combination of DUP and EXECUTE-;S.
This allows exploring the difference between dynamically-generated and
static native code in Gforth. Here are the different code sequences:
In all versions, the same static docol sequence is used
add $0x8,%rbx
sub $0x8,%r14
mov %rbx,(%r14)
mov %rdx,%rbx
mov (%rbx),%rax
jmp *%rax
For FOO, there are the following different sequences:
1) dynamic code for "dup execute-exit" (sequence)
2) dynamic code for "dup-execute-exit" (primitive)
3) static code for "dup-execute-exit" (primitive)
dynamic sequence dynamic primitive static primitive
mov %r8,%r15
add $0x10,%rbx add $0x8,%rbx
mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx
mov -0x10(%r15),%rax mov -0x10(%r8),%rax mov -0x10(%r8),%rax
mov %r15,%rdx mov %r8,%rdx mov %r8,%rdx
add $0x8,%r14 add $0x8,%r14 add $0x8,%r14
sub $0x8,%rbx sub $0x8,%rbx sub $0x8,%rbx
jmp *%rax jmp *%rax jmp *%rax
To eliminate the difference between the dynamic and static primitive
variants, I also measured a variant where I manually arranged the
dynamic code to not execute the "add" at the start:
4) static-like dynamic code for "dup-execute-exit" (primitive)
I measured this on a Zen3, which has a similar difference between the
Gforth code and the assembly-language code as the Zen4. The results are:
c/it
8 1) dynamic sequence
8 2) dynamic primitive
2 3) static primitive
8 4) static-like dynamic primitive
3 5) 4) with dynamic docol (see below)
2 6) 5) with aligned dynamic docol (see below)
So apparently the difference between static code and dynamic code
causes the slowdown on Zen3 (and probably on Zen4).
5) One reason could be that the dynamic code is far away in the address
space from the static code of the docol. E.g., in one execution of 4)
the code for docol starts at 0x00005558a3b5eac3 and the code for the
dup-execute-exit starts at 0x00007f937beae764. In order to test this
theory, I copied the docol code right behind the dup-execute-exit code
and made the pointer to docol point to it. And indeed, the speed
increased to 3 cycles/iteration.
So the distance plays a role in Zen3 and probably others; I guess they
do not store the full length of the target in the L1 BTB, and such a
far branch therefore is never promoted to the L1 BTB; the branch
therefore uses the L2 BTB and takes several cycles.
6) There is still one cycle/iteration of difference between 3) and 5),
but I guess this can be explained with the usual sources of
variations, such as code alignment variations. I tried this theory by
aligning the copied docol code to a 32-byte boundary. And that indeed
produced 2 cycles/iteration.
Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build. I don't see
an extension of my BTB theory that would explain this. So either my
BTB theory is wrong or there is another effect at work.
Here's how you can reproduce this:
For adding the primitive, I added
dup-execute-;s ( xt R:w -- xt ) gforth-internal dup_execute_semis
SET_IP((Xt *)w);
SUPER_END;
VM_JUMP(EXEC1(xt));
to the file prim in Gforth (commit
d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with
gcc-10.2.1).
The measurements were then done on a Ryzen 5800X with:
1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo"
2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"
3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"
4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo"
5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo"
6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo"
This code always ends in an endless loop, so I pressed Ctrl-C after a
second or so, and then computed
(cycles/instructions)*(instructions/iteration)
where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>