Article <2024Jul13.173138@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Jul13.173138@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Implementing DOES>: How not to do it (and why not) and how to do it
Date: Sat, 13 Jul 2024 15:31:38 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 160
Message-ID: <2024Jul13.173138@mips.complang.tuwien.ac.at>
References: <2024Jul11.160602@mips.complang.tuwien.ac.at>
Injection-Date: Sat, 13 Jul 2024 18:12:58 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="28ef6df2cfd1883d49c452e1cc6b3e39";
	logging-data="3853199"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/KcE5P7Mi/q9AbisNb0Ug4"
Cancel-Lock: sha1:kzzRUiteXJenHvV6ziDeNRx7XBA=
X-newsreader: xrn 10.11
Bytes: 6865

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>At least one Forth system implements DOES> inefficiently, but I
>suspect that it's not alone in that.

And indeed, a second system has the same problem; it shows up more
rarely, because normally this system inlines does>-defined words, but
when it does not, it performs badly.

Here's a microbenchmark where the second system does not inline the
does-defined word:

50000000 constant iterations
: faccum
    create 0e f,
  does> ( r1 -- r2 )
    dup f@ f+ fdup f! ;

: faccum-part2 ( r1 addr -- r2 )
    dup f@ f+ fdup f! ;
    
faccum x4 \ 2e x4 fdrop
faccum y4 \ -4e y4 fdrop

: b4 0e iterations 0 do x4 y4 loop ;
: b5 0e iterations 0 do
        [ ' x4 >body ] literal faccum-part2
        [ ' y4 >body ] literal faccum-part2
     loop ;


First, let's see what the Forth systems do by themselves (the B4
microbenchmark); numbers from a Skylake; I have replaced the names of
the Forth systems with inefficient DOES> implementations with A and B.

[~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"
0.

 Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b4 f. cr bye':

       948_628_907      cycles:u
     3_695_796_028      instructions:u            #    3.90  insn per cycle
         1_154_670      L1-dcache-load-misses
           198_627      L1-icache-load-misses
           306_507      branch-misses

       0.245984689 seconds time elapsed

       0.244894000 seconds user
       0.000000000 seconds sys


[~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b4 f. cr bye"
0.00000000


 Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':

    38_769_505_700      cycles:u
     1_704_476_397      instructions:u            #    0.04  insn per cycle
       178_288_238      L1-dcache-load-misses
       250_454_606      L1-icache-load-misses
       100_090_310      branch-misses

       9.719803719 seconds time elapsed

       9.715343000 seconds user
       0.000000000 seconds sys


[~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b4 f. cr bye"

Including does-microbench.fs0.


 Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':

    39_200_313_445      cycles:u
     1_413_936_888      instructions:u            #    0.04  insn per cycle
       150_445_572      L1-dcache-load-misses
       209_127_540      L1-icache-load-misses
       100_128_427      branch-misses

       9.822342252 seconds time elapsed

       9.817016000 seconds user
       0.000000000 seconds sys

So both A and B fall into the cache-ping-pong and the return stack
misprediction pitfalls in this case, resulting in a factor 40 slowdown
compared to Gforth.

Let's see how it works if we use the code I suggest (simulated in B5):

[~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"
0. 

 Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b5 f. cr bye':

       943_277_009      cycles:u
     3_295_795_332      instructions:u            #    3.49  insn per cycle
         1_147_107      L1-dcache-load-misses
           198_364      L1-icache-load-misses
           295_186      branch-misses

       0.247765182 seconds time elapsed

       0.242645000 seconds user
       0.004044000 seconds sys


[~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b5 f. cr bye"
0.00000000


 Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':

    23_587_381_659      cycles:u
     1_604_475_561      instructions:u            #    0.07  insn per cycle
       100_111_296      L1-dcache-load-misses
       100_502_420      L1-icache-load-misses
            77_126      branch-misses

       6.055177414 seconds time elapsed

       6.055288000 seconds user
       0.000000000 seconds sys


[~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b5 f. cr bye"

Including does-microbench.fs0.

 Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':

       949_044_323      cycles:u
     1_313_933_897      instructions:u            #    1.38  insn per cycle
           246_252      L1-dcache-load-misses
           105_517      L1-icache-load-misses
            61_449      branch-misses

       0.239750023 seconds time elapsed

       0.239811000 seconds user
       0.000000000 seconds sys

This solves both problems for B, but A still suffers from
cache ping-pong; I suspect that this is because there is not enough
distance between the modified data and FACCUM-PART2 (or, less likely,
not enough distance between the modified data and the loop in B5).

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
   EuroForth 2024: https://euro.theforth.net