Sujet : Re: Efficient implementation of Finite State Machines (FSM)
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forthDate : 11. Mar 2025, 10:53:42
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Mar11.105342@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : xrn 10.11
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Here are the results on a Zen4:
>
loop plain optimized implementation variant
1_278_763_454 1_175_241_846 1_175_505_964 cycles
3_651_376_357 2_441_376_030 2_045_832_844 instructions
...
For now I don't see why the whole thing takes so many cycles. I'll
take a closer look later.
It seems to be an interaction between microarchitectural
corner-cutting in Zen4 and the way that Gforth is implemented, see
<
2025Mar11.091817@mips.complang.tuwien.ac.at> and
<
2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch.
So let's measure on Golden Cove (Intel 1315U P-Core) which apparently
does not have that problem, at least not for the "dup execute-exit"
microbenchmark I used in those measurements. For the fsm-ae.4th
bench1 benchmark I see the following, however:
loop plain optimized implementation variant
1_193_518_309 1_284_084_677 1_282_838_710 cycles
3_755_406_062 2_445_475_893 2_049_460_804 instructions
I.e., the plain and optimized variants are even slower than on Zen4,
and the loop version is not just faster than on Zen4, but even faster
than plain and optimized. To see why that is, here's the out->out code followed by the docol code:
count 1->2
mov rax,r8
add r8,$01
movzx r15d,byte PTR [eax]
cells 2->2
shl r15,$03
lit+ 2->2
outside-num
0-6 add r15,$18[rbx]
@ 2->2
6-11 mov r15,[r15]
execute-;s 2->1
add rbx,$30
mov rbx,[r14]
mov rax,-$10[r15]
11-11 mov rdx,r15
add r14,$08
sub rbx,$08
jmp eax
add rbx,$08
sub r14,$08
mov [r14],rbx
11-11 mov rbx,rdx
mov rax,[rbx]
jmp eax
In the benchmark the out->out code jumps to docol and docol jumps to
out->out in 99/100 cases.
I see a depence loop here that works through the instructins where I
have added some annotations like "0-6" in front. 0-6 means that with
rbx available in cycle 0, the result is available in r15 in cycle 6 (5
cycles from the load, 1 from the addition). The next instruction in
the chain depends on that result in r15 and produces another result in
r15. Then we have mov instructions that move the result into rdx and
finally into rbx where it enters the sequence again.
Register-register mov instructions usually consume no cycles on the
Golden Cove, so they are counted with 0 cycles here.
My computation explains 11 cycles per iteration, but not the measured
12; I may be unaware of another source of delay, or some other
dependence chain may be the reason; I am pretty sure that I have the
right one, though, especially because the results with "dup
execute-exit" eliminate this particular dependence chain and run at
2.4 cycles per iteration on the Golden Cove.
I also let llvm-mca (microarchitectural analysis) have a go at it, and
it assumes 1 cycle latency for the register-register moves, and
reports:
[~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline
Iterations: 1000
Instructions: 19000
Total Cycles: 13019
Total uOps: 12000
So it claims that this sequence should take 13 cycles per iteration.
It also gives various other information but I find that hard to
interpret.
- anton
-- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.htmlcomp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/EuroForth 2024 proceedings:
http://www.euroforth.org/ef24/papers/