Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Krishna Myneni <krishna.myneni@ccreweb.org>
Newsgroups: comp.lang.forth
Subject: Re: Implementing DOES>: How not to do it (and why not) and how to do
 it
Date: Sun, 14 Jul 2024 14:28:33 -0500
Organization: A noiseless patient Spider
Lines: 177
Message-ID: <v718t1$9agb$1@dont-email.me>
References: <2024Jul11.160602@mips.complang.tuwien.ac.at>
 <2024Jul13.173138@mips.complang.tuwien.ac.at>
 <nnd$68dd354d$5a60d664@a6110a1e6f38ddc9> <v70fph$4mpn$1@dont-email.me>
 <v715jj$8ni7$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 14 Jul 2024 21:28:34 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5e796fb6ecb50fa7849c2c3f73039d52";
	logging-data="305675"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18b2AnP7I9BHN398USROwyG"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:kAnIhy0jI2BdjoLup90/V9H0q9I=
Content-Language: en-US
In-Reply-To: <v715jj$8ni7$1@dont-email.me>
Bytes: 6308

On 7/14/24 13:32, Krishna Myneni wrote:
> On 7/14/24 07:20, Krishna Myneni wrote:
>> On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
>>> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>>> <SNIP>
>>>>
>>>> In any case, if you are a system implementor, you may want to check
>>>> your DOES> implementation with a microbenchmark that stores into the
>>>> does-defined word in a case where that word is not inlined.
>>>
>>> Is that equally valid for indirect threaded code?
>>> In indirect threaded code the instruction and data cache
>>> are more separated, e.g. in a simple Forth all the low level
>>> code could fit in the I-cache, if I'm not mistaken.
>>>
>>
>>
>> Let's check. In kForth-64, an indirect threaded code system,
>>
>> .s
>> <empty>
>>   ok
>> f.s
>> fs: <empty>
>>   ok
>> ms@ b4 ms@ swap - .
>> 4274  ok
>> ms@ b5 ms@ swap - .
>> 3648  ok
>>
>> So b5 appears to be more efficient that b4 ( the version with DOES> ).
>>
>> -- 
>> Krishna
>>
>> === begin code ===
>> 50000000 constant iterations
>>
>> : faccum  create 1 floats allot? 0.0e f!
>>      does> dup f@ f+ fdup f! ;
>>
>> : faccum-part2 ( F: r1 -- r2 ) ( a -- )
>>      dup f@ f+ fdup f! ;
>>
>> faccum x4  2.0e x4 fdrop
>> faccum y4 -4.0e y4 fdrop
>>
>> : b4 0.0e iterations 0 do x4 y4 loop ;
>> : b5 0.0e iterations 0 do
>>      [ ' x4 >body ] literal faccum-part2
>>      [ ' y4 >body ] literal faccum-part2
>>    loop ;
>> === end code ===
>>
>>
>>
>>
> 
> Using perf to obtain the microbenchmarks for B4 and B5,
> 
> B4
> 
> $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e 
> L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 
> -e "include does-microbench.4th b4 f. cr bye"
> -inf
> Goodbye.
> 
>   Performance counter stats for 'kforth64 -e include does-microbench.4th 
> b4 f. cr bye':
> 
>         14_381_951_937      cycles:u
>         26_206_810_946      instructions:u     #    1.82  insn per cycle
>               58_563        L1-dcache-load-misses:u
>               14_742        L1-icache-load-misses:u
>           100_122_231       branch-misses:u
> 
>         4.501011307 seconds time elapsed
> 
>         4.477172000 seconds user
>         0.003967000 seconds sys
> 
> 
> B5
> 
> $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e 
> L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 
> -e "include does-microbench.4th b5 f. cr bye"
> -inf
> Goodbye.
> 
>   Performance counter stats for 'kforth64 -e include does-microbench.4th 
> b5 f. cr bye':
> 
>         11_529_644_734      cycles:u
>         18_906_809_683      instructions:u      #    1.64  insn per cycle
>               59_605        L1-dcache-load-misses:u
>               21_531        L1-icache-load-misses:u
>           100_109_360       branch-misses:u
> 
>         3.616353010 seconds time elapsed
> 
>         3.600206000 seconds user
>         0.004639000 seconds sys
> 
> 
> It appears that the cache misses are fairly small for both b4 and b5, 
> but the branch misses are very high in my system.
> 


The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch 
misses were quite few.

B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e 
L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 
-e "include faccum.4th b4 f. cr bye"
0
Goodbye.

  Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr 
bye':

         7_847_499_582      cycles:u 

        26_206_205_780      instructions:u     #    3.34  insn per cycle 

              67_785        L1-dcache-load-misses:u 

              65_391        L1-icache-load-misses:u 

              38_308        branch-misses:u 


        2.014078890 seconds time elapsed

        2.010013000 seconds user
        0.000999000 seconds sys

B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e 
L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 
-e "include faccum.4th b5 f. cr bye"
0
Goodbye.

  Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr 
bye':

         5_314_718_609      cycles:u 

        18_906_206_178      instructions:u     #    3.56  insn per cycle 

              64_150        L1-dcache-load-misses:u 

              44_818        L1-icache-load-misses:u 

              29_941        branch-misses:u 


        1.372367863 seconds time elapsed

        1.367289000 seconds user
        0.002989000 seconds sys


The efficiency difference is due entirely to the number of instructions 
being executed for B4 and B5.

--
KM

========== REMAINDER OF ARTICLE TRUNCATED ==========