Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Krishna Myneni Newsgroups: comp.lang.forth Subject: Re: Implementing DOES>: How not to do it (and why not) and how to do it Date: Sun, 14 Jul 2024 14:28:33 -0500 Organization: A noiseless patient Spider Lines: 177 Message-ID: References: <2024Jul11.160602@mips.complang.tuwien.ac.at> <2024Jul13.173138@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 14 Jul 2024 21:28:34 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5e796fb6ecb50fa7849c2c3f73039d52"; logging-data="305675"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18b2AnP7I9BHN398USROwyG" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:kAnIhy0jI2BdjoLup90/V9H0q9I= Content-Language: en-US In-Reply-To: Bytes: 6308 On 7/14/24 13:32, Krishna Myneni wrote: > On 7/14/24 07:20, Krishna Myneni wrote: >> On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote: >>> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>, >>> Anton Ertl wrote: >>> >>>> >>>> In any case, if you are a system implementor, you may want to check >>>> your DOES> implementation with a microbenchmark that stores into the >>>> does-defined word in a case where that word is not inlined. >>> >>> Is that equally valid for indirect threaded code? >>> In indirect threaded code the instruction and data cache >>> are more separated, e.g. in a simple Forth all the low level >>> code could fit in the I-cache, if I'm not mistaken. >>> >> >> >> Let's check. In kForth-64, an indirect threaded code system, >> >> .s >> >>   ok >> f.s >> fs: >>   ok >> ms@ b4 ms@ swap - . >> 4274  ok >> ms@ b5 ms@ swap - . >> 3648  ok >> >> So b5 appears to be more efficient that b4 ( the version with DOES> ). >> >> -- >> Krishna >> >> === begin code === >> 50000000 constant iterations >> >> : faccum  create 1 floats allot? 0.0e f! >>      does> dup f@ f+ fdup f! ; >> >> : faccum-part2 ( F: r1 -- r2 ) ( a -- ) >>      dup f@ f+ fdup f! ; >> >> faccum x4  2.0e x4 fdrop >> faccum y4 -4.0e y4 fdrop >> >> : b4 0.0e iterations 0 do x4 y4 loop ; >> : b5 0.0e iterations 0 do >>      [ ' x4 >body ] literal faccum-part2 >>      [ ' y4 >body ] literal faccum-part2 >>    loop ; >> === end code === >> >> >> >> > > Using perf to obtain the microbenchmarks for B4 and B5, > > B4 > > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 > -e "include does-microbench.4th b4 f. cr bye" > -inf > Goodbye. > >  Performance counter stats for 'kforth64 -e include does-microbench.4th > b4 f. cr bye': > >        14_381_951_937      cycles:u >        26_206_810_946      instructions:u     #    1.82  insn per cycle >              58_563        L1-dcache-load-misses:u >              14_742        L1-icache-load-misses:u >          100_122_231       branch-misses:u > >        4.501011307 seconds time elapsed > >        4.477172000 seconds user >        0.003967000 seconds sys > > > B5 > > $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e > L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 > -e "include does-microbench.4th b5 f. cr bye" > -inf > Goodbye. > >  Performance counter stats for 'kforth64 -e include does-microbench.4th > b5 f. cr bye': > >        11_529_644_734      cycles:u >        18_906_809_683      instructions:u      #    1.64  insn per cycle >              59_605        L1-dcache-load-misses:u >              21_531        L1-icache-load-misses:u >          100_109_360       branch-misses:u > >        3.616353010 seconds time elapsed > >        3.600206000 seconds user >        0.004639000 seconds sys > > > It appears that the cache misses are fairly small for both b4 and b5, > but the branch misses are very high in my system. > The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz. On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch misses were quite few. B4 $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b4 f. cr bye" 0 Goodbye. Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr bye': 7_847_499_582 cycles:u 26_206_205_780 instructions:u # 3.34 insn per cycle 67_785 L1-dcache-load-misses:u 65_391 L1-icache-load-misses:u 38_308 branch-misses:u 2.014078890 seconds time elapsed 2.010013000 seconds user 0.000999000 seconds sys B5 $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b5 f. cr bye" 0 Goodbye. Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr bye': 5_314_718_609 cycles:u 18_906_206_178 instructions:u # 3.56 insn per cycle 64_150 L1-dcache-load-misses:u 44_818 L1-icache-load-misses:u 29_941 branch-misses:u 1.372367863 seconds time elapsed 1.367289000 seconds user 0.002989000 seconds sys The efficiency difference is due entirely to the number of instructions being executed for B4 and B5. -- KM ========== REMAINDER OF ARTICLE TRUNCATED ==========