Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.lang.forth Subject: Re: Efficient implementation of Finite State Machines (FSM) Date: Tue, 11 Mar 2025 09:53:42 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 97 Message-ID: <2025Mar11.105342@mips.complang.tuwien.ac.at> References: <2025Mar5.175013@mips.complang.tuwien.ac.at> <2025Mar5.200303@mips.complang.tuwien.ac.at> <001c29937cf8440ee5cc4a26135820af@www.novabbs.com> <2025Mar6.190930@mips.complang.tuwien.ac.at> <2025Mar9.172923@mips.complang.tuwien.ac.at> Injection-Date: Tue, 11 Mar 2025 11:45:10 +0100 (CET) Injection-Info: dont-email.me; posting-host="c8a9ea49f90bdbf4e8fcda9a2dd6bc1f"; logging-data="2037622"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/TzrAw8awyxDU70LLyTSHd" Cancel-Lock: sha1:4jXBp2nd01IDJBT9ZfRdsHQXBUk= X-newsreader: xrn 10.11 Bytes: 5180 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >Here are the results on a Zen4: > > loop plain optimized implementation variant >1_278_763_454 1_175_241_846 1_175_505_964 cycles >3_651_376_357 2_441_376_030 2_045_832_844 instructions .... >For now I don't see why the whole thing takes so many cycles. I'll >take a closer look later. It seems to be an interaction between microarchitectural corner-cutting in Zen4 and the way that Gforth is implemented, see <2025Mar11.091817@mips.complang.tuwien.ac.at> and <2025Mar10.181427@mips.complang.tuwien.ac.at> in comp.arch. So let's measure on Golden Cove (Intel 1315U P-Core) which apparently does not have that problem, at least not for the "dup execute-exit" microbenchmark I used in those measurements. For the fsm-ae.4th bench1 benchmark I see the following, however: loop plain optimized implementation variant 1_193_518_309 1_284_084_677 1_282_838_710 cycles 3_755_406_062 2_445_475_893 2_049_460_804 instructions I.e., the plain and optimized variants are even slower than on Zen4, and the loop version is not just faster than on Zen4, but even faster than plain and optimized. To see why that is, here's the out->out code followed by the docol code: count 1->2 mov rax,r8 add r8,$01 movzx r15d,byte PTR [eax] cells 2->2 shl r15,$03 lit+ 2->2 outside-num 0-6 add r15,$18[rbx] @ 2->2 6-11 mov r15,[r15] execute-;s 2->1 add rbx,$30 mov rbx,[r14] mov rax,-$10[r15] 11-11 mov rdx,r15 add r14,$08 sub rbx,$08 jmp eax add rbx,$08 sub r14,$08 mov [r14],rbx 11-11 mov rbx,rdx mov rax,[rbx] jmp eax In the benchmark the out->out code jumps to docol and docol jumps to out->out in 99/100 cases. I see a depence loop here that works through the instructins where I have added some annotations like "0-6" in front. 0-6 means that with rbx available in cycle 0, the result is available in r15 in cycle 6 (5 cycles from the load, 1 from the addition). The next instruction in the chain depends on that result in r15 and produces another result in r15. Then we have mov instructions that move the result into rdx and finally into rbx where it enters the sequence again. Register-register mov instructions usually consume no cycles on the Golden Cove, so they are counted with 0 cycles here. My computation explains 11 cycles per iteration, but not the measured 12; I may be unaware of another source of delay, or some other dependence chain may be the reason; I am pretty sure that I have the right one, though, especially because the results with "dup execute-exit" eliminate this particular dependence chain and run at 2.4 cycles per iteration on the Golden Cove. I also let llvm-mca (microarchitectural analysis) have a go at it, and it assumes 1 cycle latency for the register-register moves, and reports: [~:156499] cat tmp/yyy.s |llvm-mca-16 -mcpu=alderlake --iterations=1000 -timeline Iterations: 1000 Instructions: 19000 Total Cycles: 13019 Total uOps: 12000 So it claims that this sequence should take 13 cycles per iteration. It also gives various other information but I find that hard to interpret. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/ EuroForth 2023 proceedings: http://www.euroforth.org/ef23/papers/ EuroForth 2024 proceedings: http://www.euroforth.org/ef24/papers/