| Deutsch English Français Italiano |
|
<v718t1$9agb$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Krishna Myneni <krishna.myneni@ccreweb.org>
Newsgroups: comp.lang.forth
Subject: Re: Implementing DOES>: How not to do it (and why not) and how to do
it
Date: Sun, 14 Jul 2024 14:28:33 -0500
Organization: A noiseless patient Spider
Lines: 177
Message-ID: <v718t1$9agb$1@dont-email.me>
References: <2024Jul11.160602@mips.complang.tuwien.ac.at>
<2024Jul13.173138@mips.complang.tuwien.ac.at>
<nnd$68dd354d$5a60d664@a6110a1e6f38ddc9> <v70fph$4mpn$1@dont-email.me>
<v715jj$8ni7$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 14 Jul 2024 21:28:34 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5e796fb6ecb50fa7849c2c3f73039d52";
logging-data="305675"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18b2AnP7I9BHN398USROwyG"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:kAnIhy0jI2BdjoLup90/V9H0q9I=
Content-Language: en-US
In-Reply-To: <v715jj$8ni7$1@dont-email.me>
Bytes: 6308
On 7/14/24 13:32, Krishna Myneni wrote:
> On 7/14/24 07:20, Krishna Myneni wrote:
>> On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
>>> In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>>> <SNIP>
>>>>
>>>> In any case, if you are a system implementor, you may want to check
>>>> your DOES> implementation with a microbenchmark that stores into the
>>>> does-defined word in a case where that word is not inlined.
>>>
>>> Is that equally valid for indirect threaded code?
>>> In indirect threaded code the instruction and data cache
>>> are more separated, e.g. in a simple Forth all the low level
>>> code could fit in the I-cache, if I'm not mistaken.
>>>
>>
>>
>> Let's check. In kForth-64, an indirect threaded code system,
>>
>> .s
>> <empty>
>> ok
>> f.s
>> fs: <empty>
>> ok
>> ms@ b4 ms@ swap - .
>> 4274 ok
>> ms@ b5 ms@ swap - .
>> 3648 ok
>>
>> So b5 appears to be more efficient that b4 ( the version with DOES> ).
>>
>> --
>> Krishna
>>
>> === begin code ===
>> 50000000 constant iterations
>>
>> : faccum create 1 floats allot? 0.0e f!
>> does> dup f@ f+ fdup f! ;
>>
>> : faccum-part2 ( F: r1 -- r2 ) ( a -- )
>> dup f@ f+ fdup f! ;
>>
>> faccum x4 2.0e x4 fdrop
>> faccum y4 -4.0e y4 fdrop
>>
>> : b4 0.0e iterations 0 do x4 y4 loop ;
>> : b5 0.0e iterations 0 do
>> [ ' x4 >body ] literal faccum-part2
>> [ ' y4 >body ] literal faccum-part2
>> loop ;
>> === end code ===
>>
>>
>>
>>
>
> Using perf to obtain the microbenchmarks for B4 and B5,
>
> B4
>
> $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e
> L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
> -e "include does-microbench.4th b4 f. cr bye"
> -inf
> Goodbye.
>
> Performance counter stats for 'kforth64 -e include does-microbench.4th
> b4 f. cr bye':
>
> 14_381_951_937 cycles:u
> 26_206_810_946 instructions:u # 1.82 insn per cycle
> 58_563 L1-dcache-load-misses:u
> 14_742 L1-icache-load-misses:u
> 100_122_231 branch-misses:u
>
> 4.501011307 seconds time elapsed
>
> 4.477172000 seconds user
> 0.003967000 seconds sys
>
>
> B5
>
> $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e
> L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
> -e "include does-microbench.4th b5 f. cr bye"
> -inf
> Goodbye.
>
> Performance counter stats for 'kforth64 -e include does-microbench.4th
> b5 f. cr bye':
>
> 11_529_644_734 cycles:u
> 18_906_809_683 instructions:u # 1.64 insn per cycle
> 59_605 L1-dcache-load-misses:u
> 21_531 L1-icache-load-misses:u
> 100_109_360 branch-misses:u
>
> 3.616353010 seconds time elapsed
>
> 3.600206000 seconds user
> 0.004639000 seconds sys
>
>
> It appears that the cache misses are fairly small for both b4 and b5,
> but the branch misses are very high in my system.
>
The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch
misses were quite few.
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e
L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include faccum.4th b4 f. cr bye"
0
Goodbye.
Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr
bye':
7_847_499_582 cycles:u
26_206_205_780 instructions:u # 3.34 insn per cycle
67_785 L1-dcache-load-misses:u
65_391 L1-icache-load-misses:u
38_308 branch-misses:u
2.014078890 seconds time elapsed
2.010013000 seconds user
0.000999000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e
L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64
-e "include faccum.4th b5 f. cr bye"
0
Goodbye.
Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr
bye':
5_314_718_609 cycles:u
18_906_206_178 instructions:u # 3.56 insn per cycle
64_150 L1-dcache-load-misses:u
44_818 L1-icache-load-misses:u
29_941 branch-misses:u
1.372367863 seconds time elapsed
1.367289000 seconds user
0.002989000 seconds sys
The efficiency difference is due entirely to the number of instructions
being executed for B4 and B5.
--
KM
========== REMAINDER OF ARTICLE TRUNCATED ==========