Newsportal USENET - Re: Efficient implementation of Finite State Machines (FSM)

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

Here are the results on a Zen4:
>
loop plain optimized implementation variant
1_278_763_454 1_175_241_846 1_175_505_964 cycles
3_651_376_357 2_441_376_030 2_045_832_844 instructions

...

For now I don't see why the whole thing takes so many cycles. I'll
take a closer look later.

It seems to be an interaction between microarchitectural
corner-cutting in Zen4 and the way that Gforth is implemented, see
<2025Mar11.091817@mips.complang.tuwien.ac.at> and
<2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch.

So let's measure on Golden Cove (Intel 1315U P-Core) which apparently
does not have that problem, at least not for the "dup execute-exit"
microbenchmark I used in those measurements. For the fsm-ae.4th
bench1 benchmark I see the following, however:

loop    plain    optimized   implementation variant
1_193_518_309 1_284_084_677 1_282_838_710 cycles
3_755_406_062 2_445_475_893 2_049_460_804 instructions

I.e., the plain and optimized variants are even slower than on Zen4,
and the loop version is not just faster than on Zen4, but even faster
than plain and optimized. To see why that is, here's the out->out code followed by the docol code:

count 1->2
mov    rax,r8
add    r8,$01
movzx   r15d,byte PTR [eax]
cells 2->2
shl    r15,$03
lit+ 2->2
outside-num
0-6    add    r15,$18[rbx]
@ 2->2
6-11 mov    r15,[r15]
execute-;s 2->1
add    rbx,$30
mov    rbx,[r14]
mov    rax,-$10[r15]
11-11   mov    rdx,r15
add    r14,$08
sub    rbx,$08
jmp    eax

add    rbx,$08
sub    r14,$08
mov    [r14],rbx
11-11   mov    rbx,rdx
mov    rax,[rbx]
jmp    eax

In the benchmark the out->out code jumps to docol and docol jumps to
out->out in 99/100 cases.

I see a depence loop here that works through the instructins where I
have added some annotations like "0-6" in front. 0-6 means that with
rbx available in cycle 0, the result is available in r15 in cycle 6 (5
cycles from the load, 1 from the addition). The next instruction in
the chain depends on that result in r15 and produces another result in
r15. Then we have mov instructions that move the result into rdx and
finally into rbx where it enters the sequence again.
Register-register mov instructions usually consume no cycles on the
Golden Cove, so they are counted with 0 cycles here.

My computation explains 11 cycles per iteration, but not the measured
12; I may be unaware of another source of delay, or some other
dependence chain may be the reason; I am pretty sure that I have the
right one, though, especially because the results with "dup
execute-exit" eliminate this particular dependence chain and run at
2.4 cycles per iteration on the Golden Cove.

I also let llvm-mca (microarchitectural analysis) have a go at it, and
it assumes 1 cycle latency for the register-register moves, and
reports:

[~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline
Iterations: 1000
Instructions: 19000
Total Cycles: 13019
Total uOps: 12000

So it claims that this sequence should take 13 cycles per iteration.

It also gives various other information but I find that hard to
interpret.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
   New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date	Sujet	#	Auteur
3 Mar 25	Fetch string from comment	29	sjack
3 Mar 25	Re: Fetch string from comment	2	minforth
4 Mar 25	Re: Fetch string from comment	1	sjack
4 Mar 25	Re: Fetch string from comment	10	dxf
4 Mar 25	Re: Fetch string from comment	9	sjack
4 Mar 25	Re: Fetch string from comment	8	dxf
5 Mar 25	Re: Fetch string from comment	4	mhx
5 Mar 25	Re: Fetch string from comment	1	Anton Ertl
5 Mar 25	Re: Fetch string from comment	2	dxf
6 Mar 25	Re: Fetch string from comment	1	dxf
5 Mar 25	Re: Fetch string from comment	3	sjack
5 Mar 25	Re: Fetch string from comment	2	sjack
5 Mar 25	Re: Fetch string from comment	1	minforth
4 Mar 25	Re: Fetch string from comment	14	mhx
5 Mar 25	Re: Fetch string from comment	13	Gerry Jackson
5 Mar 25	Re: Fetch string from comment	12	Anton Ertl
5 Mar 25	Re: Fetch string from comment	10	mhx
5 Mar 25	Re: Fetch string from comment	9	Anton Ertl
6 Mar 25	Re: Fetch string from comment	8	mhx
6 Mar 25	Re: Fetch string from comment	7	Anton Ertl
7 Mar 25	Re: Fetch string from comment	1	mhx
7 Mar 25	Efficient implementation of Finite State Machines (FSM)	5	Gerry Jackson
9 Mar 25	Re: Efficient implementation of Finite State Machines (FSM)	4	Anton Ertl
10 Mar 25	Re: Efficient implementation of Finite State Machines (FSM)	1	Gerry Jackson
11 Mar 25	Re: Efficient implementation of Finite State Machines (FSM)	1	Anton Ertl
16 Jul 25	Re: Efficient implementation of Finite State Machines (FSM)	1	Gerry Jackson
5 Mar 25	Re: Fetch string from comment	1	Gerry Jackson
4 Mar 25	Re: Fetch string from comment	1	sjack
7 Mar 25	Re: Fetch string from comment	1	sjack