Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Efficient implementation of Finite State Machines (FSM)
Date: Tue, 11 Mar 2025 09:53:42 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 97
Message-ID: <2025Mar11.105342@mips.complang.tuwien.ac.at>
References: <vq4rmt$1dv1v$1@dont-email.me> <nnd$05d8817a$443d26e2@cbc0f0f0dc4c07a4> <de0b1a90f9b6e2f4a5c05d1f9135e64e@www.novabbs.com> <vq9epj$2cvkj$1@dont-email.me> <2025Mar5.175013@mips.complang.tuwien.ac.at> <d122f0610edfd0bdc2895aa1b726cf1e@www.novabbs.com> <2025Mar5.200303@mips.complang.tuwien.ac.at> <001c29937cf8440ee5cc4a26135820af@www.novabbs.com> <2025Mar6.190930@mips.complang.tuwien.ac.at> <vqeogh$3ejl4$1@dont-email.me> <2025Mar9.172923@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 11 Mar 2025 11:45:10 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c8a9ea49f90bdbf4e8fcda9a2dd6bc1f";
	logging-data="2037622"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/TzrAw8awyxDU70LLyTSHd"
Cancel-Lock: sha1:4jXBp2nd01IDJBT9ZfRdsHQXBUk=
X-newsreader: xrn 10.11
Bytes: 5180

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>Here are the results on a Zen4:
>
>  loop             plain         optimized   implementation variant
>1_278_763_454  1_175_241_846  1_175_505_964  cycles
>3_651_376_357  2_441_376_030  2_045_832_844  instructions
....
>For now I don't see why the whole thing takes so many cycles.  I'll
>take a closer look later.

It seems to be an interaction between microarchitectural
corner-cutting in Zen4 and the way that Gforth is implemented, see
<2025Mar11.091817@mips.complang.tuwien.ac.at> and
<2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch.

So let's measure on Golden Cove (Intel 1315U P-Core) which apparently
does not have that problem, at least not for the "dup execute-exit"
microbenchmark I used in those measurements.  For the fsm-ae.4th
bench1 benchmark I see the following, however:

  loop             plain         optimized   implementation variant
1_193_518_309  1_284_084_677  1_282_838_710  cycles
3_755_406_062  2_445_475_893  2_049_460_804  instructions

I.e., the plain and optimized variants are even slower than on Zen4,
and the loop version is not just faster than on Zen4, but even faster
than plain and optimized.  To see why that is, here's the out->out code followed by the docol code:

      count    1->2 
        mov     rax,r8
        add     r8,$01
        movzx   r15d,byte PTR [eax]
      cells    2->2 
        shl     r15,$03
      lit+    2->2 
      outside-num
0-6     add     r15,$18[rbx]
      @    2->2 
6-11    mov     r15,[r15]
      execute-;s    2->1 
        add     rbx,$30
        mov     rbx,[r14]
        mov     rax,-$10[r15]
11-11   mov     rdx,r15
        add     r14,$08
        sub     rbx,$08
        jmp     eax
      
        add     rbx,$08
        sub     r14,$08
        mov     [r14],rbx
11-11   mov     rbx,rdx
        mov     rax,[rbx]
        jmp     eax

In the benchmark the out->out code jumps to docol and docol jumps to
out->out in 99/100 cases.

I see a depence loop here that works through the instructins where I
have added some annotations like "0-6" in front.  0-6 means that with
rbx available in cycle 0, the result is available in r15 in cycle 6 (5
cycles from the load, 1 from the addition).  The next instruction in
the chain depends on that result in r15 and produces another result in
r15.  Then we have mov instructions that move the result into rdx and
finally into rbx where it enters the sequence again.
Register-register mov instructions usually consume no cycles on the
Golden Cove, so they are counted with 0 cycles here.

My computation explains 11 cycles per iteration, but not the measured
12; I may be unaware of another source of delay, or some other
dependence chain may be the reason; I am pretty sure that I have the
right one, though, especially because the results with "dup
execute-exit" eliminate this particular dependence chain and run at
2.4 cycles per iteration on the Golden Cove.

I also let llvm-mca (microarchitectural analysis) have a go at it, and
it assumes 1 cycle latency for the register-register moves, and
reports:

[~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline
Iterations:        1000
Instructions:      19000
Total Cycles:      13019
Total uOps:        12000

So it claims that this sequence should take 13 cycles per iteration.

It also gives various other information but I find that hard to
interpret.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/
EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/