Re: Efficient implementation of Finite State Machines (FSM)

Liste des GroupesRevenir à cl forth 
Sujet : Re: Efficient implementation of Finite State Machines (FSM)
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forth
Date : 11. Mar 2025, 10:53:42
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Mar11.105342@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : xrn 10.11
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
Here are the results on a Zen4:
>
 loop             plain         optimized   implementation variant
1_278_763_454  1_175_241_846  1_175_505_964  cycles
3_651_376_357  2_441_376_030  2_045_832_844  instructions
...
For now I don't see why the whole thing takes so many cycles.  I'll
take a closer look later.

It seems to be an interaction between microarchitectural
corner-cutting in Zen4 and the way that Gforth is implemented, see
<2025Mar11.091817@mips.complang.tuwien.ac.at> and
<2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch.

So let's measure on Golden Cove (Intel 1315U P-Core) which apparently
does not have that problem, at least not for the "dup execute-exit"
microbenchmark I used in those measurements.  For the fsm-ae.4th
bench1 benchmark I see the following, however:

  loop             plain         optimized   implementation variant
1_193_518_309  1_284_084_677  1_282_838_710  cycles
3_755_406_062  2_445_475_893  2_049_460_804  instructions

I.e., the plain and optimized variants are even slower than on Zen4,
and the loop version is not just faster than on Zen4, but even faster
than plain and optimized.  To see why that is, here's the out->out code followed by the docol code:

      count    1->2
        mov     rax,r8
        add     r8,$01
        movzx   r15d,byte PTR [eax]
      cells    2->2
        shl     r15,$03
      lit+    2->2
      outside-num
0-6     add     r15,$18[rbx]
      @    2->2
6-11    mov     r15,[r15]
      execute-;s    2->1
        add     rbx,$30
        mov     rbx,[r14]
        mov     rax,-$10[r15]
11-11   mov     rdx,r15
        add     r14,$08
        sub     rbx,$08
        jmp     eax
     
        add     rbx,$08
        sub     r14,$08
        mov     [r14],rbx
11-11   mov     rbx,rdx
        mov     rax,[rbx]
        jmp     eax

In the benchmark the out->out code jumps to docol and docol jumps to
out->out in 99/100 cases.

I see a depence loop here that works through the instructins where I
have added some annotations like "0-6" in front.  0-6 means that with
rbx available in cycle 0, the result is available in r15 in cycle 6 (5
cycles from the load, 1 from the addition).  The next instruction in
the chain depends on that result in r15 and produces another result in
r15.  Then we have mov instructions that move the result into rdx and
finally into rbx where it enters the sequence again.
Register-register mov instructions usually consume no cycles on the
Golden Cove, so they are counted with 0 cycles here.

My computation explains 11 cycles per iteration, but not the measured
12; I may be unaware of another source of delay, or some other
dependence chain may be the reason; I am pretty sure that I have the
right one, though, especially because the results with "dup
execute-exit" eliminate this particular dependence chain and run at
2.4 cycles per iteration on the Golden Cove.

I also let llvm-mca (microarchitectural analysis) have a go at it, and
it assumes 1 cycle latency for the register-register moves, and
reports:

[~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline
Iterations:        1000
Instructions:      19000
Total Cycles:      13019
Total uOps:        12000

So it claims that this sequence should take 13 cycles per iteration.

It also gives various other information but I find that hard to
interpret.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/

Date Sujet#  Auteur
3 Mar 25 * Fetch string from comment28sjack
3 Mar 25 +* Re: Fetch string from comment2minforth
4 Mar 25 i`- Re: Fetch string from comment1sjack
4 Mar 25 +* Re: Fetch string from comment10dxf
4 Mar 25 i`* Re: Fetch string from comment9sjack
4 Mar 25 i `* Re: Fetch string from comment8dxf
5 Mar 25 i  +* Re: Fetch string from comment4mhx
5 Mar 25 i  i+- Re: Fetch string from comment1Anton Ertl
5 Mar 25 i  i`* Re: Fetch string from comment2dxf
6 Mar 25 i  i `- Re: Fetch string from comment1dxf
5 Mar 25 i  `* Re: Fetch string from comment3sjack
5 Mar 25 i   `* Re: Fetch string from comment2sjack
5 Mar 25 i    `- Re: Fetch string from comment1minforth
4 Mar 25 +* Re: Fetch string from comment13mhx
5 Mar 25 i`* Re: Fetch string from comment12Gerry Jackson
5 Mar 25 i `* Re: Fetch string from comment11Anton Ertl
5 Mar 25 i  +* Re: Fetch string from comment9mhx
5 Mar 25 i  i`* Re: Fetch string from comment8Anton Ertl
6 Mar 25 i  i `* Re: Fetch string from comment7mhx
6 Mar 25 i  i  `* Re: Fetch string from comment6Anton Ertl
7 Mar 25 i  i   +- Re: Fetch string from comment1mhx
7 Mar 25 i  i   `* Efficient implementation of Finite State Machines (FSM)4Gerry Jackson
9 Mar 25 i  i    `* Re: Efficient implementation of Finite State Machines (FSM)3Anton Ertl
10 Mar 25 i  i     +- Re: Efficient implementation of Finite State Machines (FSM)1Gerry Jackson
11 Mar 25 i  i     `- Re: Efficient implementation of Finite State Machines (FSM)1Anton Ertl
5 Mar 25 i  `- Re: Fetch string from comment1Gerry Jackson
4 Mar 25 +- Re: Fetch string from comment1sjack
7 Mar 25 `- Re: Fetch string from comment1sjack

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal