Deutsch English Français Italiano |
<2024Jul13.173138@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.lang.forth Subject: Re: Implementing DOES>: How not to do it (and why not) and how to do it Date: Sat, 13 Jul 2024 15:31:38 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 160 Message-ID: <2024Jul13.173138@mips.complang.tuwien.ac.at> References: <2024Jul11.160602@mips.complang.tuwien.ac.at> Injection-Date: Sat, 13 Jul 2024 18:12:58 +0200 (CEST) Injection-Info: dont-email.me; posting-host="28ef6df2cfd1883d49c452e1cc6b3e39"; logging-data="3853199"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/KcE5P7Mi/q9AbisNb0Ug4" Cancel-Lock: sha1:kzzRUiteXJenHvV6ziDeNRx7XBA= X-newsreader: xrn 10.11 Bytes: 6865 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >At least one Forth system implements DOES> inefficiently, but I >suspect that it's not alone in that. And indeed, a second system has the same problem; it shows up more rarely, because normally this system inlines does>-defined words, but when it does not, it performs badly. Here's a microbenchmark where the second system does not inline the does-defined word: 50000000 constant iterations : faccum create 0e f, does> ( r1 -- r2 ) dup f@ f+ fdup f! ; : faccum-part2 ( r1 addr -- r2 ) dup f@ f+ fdup f! ; faccum x4 \ 2e x4 fdrop faccum y4 \ -4e y4 fdrop : b4 0e iterations 0 do x4 y4 loop ; : b5 0e iterations 0 do [ ' x4 >body ] literal faccum-part2 [ ' y4 >body ] literal faccum-part2 loop ; First, let's see what the Forth systems do by themselves (the B4 microbenchmark); numbers from a Skylake; I have replaced the names of the Forth systems with inefficient DOES> implementations with A and B. [~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye" 0. Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b4 f. cr bye': 948_628_907 cycles:u 3_695_796_028 instructions:u # 3.90 insn per cycle 1_154_670 L1-dcache-load-misses 198_627 L1-icache-load-misses 306_507 branch-misses 0.245984689 seconds time elapsed 0.244894000 seconds user 0.000000000 seconds sys [~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b4 f. cr bye" 0.00000000 Performance counter stats for 'A include does-microbench.fs b4 f. cr bye': 38_769_505_700 cycles:u 1_704_476_397 instructions:u # 0.04 insn per cycle 178_288_238 L1-dcache-load-misses 250_454_606 L1-icache-load-misses 100_090_310 branch-misses 9.719803719 seconds time elapsed 9.715343000 seconds user 0.000000000 seconds sys [~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b4 f. cr bye" Including does-microbench.fs0. Performance counter stats for 'B include does-microbench.fs b4 f. cr bye': 39_200_313_445 cycles:u 1_413_936_888 instructions:u # 0.04 insn per cycle 150_445_572 L1-dcache-load-misses 209_127_540 L1-icache-load-misses 100_128_427 branch-misses 9.822342252 seconds time elapsed 9.817016000 seconds user 0.000000000 seconds sys So both A and B fall into the cache-ping-pong and the return stack misprediction pitfalls in this case, resulting in a factor 40 slowdown compared to Gforth. Let's see how it works if we use the code I suggest (simulated in B5): [~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye" 0. Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b5 f. cr bye': 943_277_009 cycles:u 3_295_795_332 instructions:u # 3.49 insn per cycle 1_147_107 L1-dcache-load-misses 198_364 L1-icache-load-misses 295_186 branch-misses 0.247765182 seconds time elapsed 0.242645000 seconds user 0.004044000 seconds sys [~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b5 f. cr bye" 0.00000000 Performance counter stats for 'A include does-microbench.fs b5 f. cr bye': 23_587_381_659 cycles:u 1_604_475_561 instructions:u # 0.07 insn per cycle 100_111_296 L1-dcache-load-misses 100_502_420 L1-icache-load-misses 77_126 branch-misses 6.055177414 seconds time elapsed 6.055288000 seconds user 0.000000000 seconds sys [~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b5 f. cr bye" Including does-microbench.fs0. Performance counter stats for 'B include does-microbench.fs b5 f. cr bye': 949_044_323 cycles:u 1_313_933_897 instructions:u # 1.38 insn per cycle 246_252 L1-dcache-load-misses 105_517 L1-icache-load-misses 61_449 branch-misses 0.239750023 seconds time elapsed 0.239811000 seconds user 0.000000000 seconds sys This solves both problems for B, but A still suffers from cache ping-pong; I suspect that this is because there is not enough distance between the modified data and FACCUM-PART2 (or, less likely, not enough distance between the modified data and the loop in B5). In any case, if you are a system implementor, you may want to check your DOES> implementation with a microbenchmark that stores into the does-defined word in a case where that word is not inlined. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/ EuroForth 2024: https://euro.theforth.net