Sujet : Re: Implementing DOES>: How not to do it (and why not) and how to do it
De : krishna.myneni (at) *nospam* ccreweb.org (Krishna Myneni)
Groupes : comp.lang.forthDate : 14. Jul 2024, 19:32:19
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v715jj$8ni7$1@dont-email.me>
References : 1 2 3 4
User-Agent : Mozilla Thunderbird
On 7/14/24 07:20, Krishna Myneni wrote:
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
>
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
>
Let's check. In kForth-64, an indirect threaded code system,
.s
<empty>
ok
f.s
fs: <empty>
ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok
So b5 appears to be more efficient that b4 ( the version with DOES> ).
-- Krishna
=== begin code ===
50000000 constant iterations
: faccum create 1 floats allot? 0.0e f!
does> dup f@ f+ fdup f! ;
: faccum-part2 ( F: r1 -- r2 ) ( a -- )
dup f@ f+ fdup f! ;
faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop
: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
loop ;
=== end code ===
Using perf to obtain the microbenchmarks for B4 and B5,
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th b4 f. cr bye':
14_381_951_937 cycles:u 26_206_810_946 instructions:u # 1.82 insn per cycle 58_563 L1-dcache-load-misses:u 14_742 L1-icache-load-misses:u 100_122_231 branch-misses:u 4.501011307 seconds time elapsed
4.477172000 seconds user
0.003967000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th b5 f. cr bye':
11_529_644_734 cycles:u 18_906_809_683 instructions:u # 1.64 insn per cycle
59_605 L1-dcache-load-misses:u 21_531 L1-icache-load-misses:u 100_109_360 branch-misses:u 3.616353010 seconds time elapsed
3.600206000 seconds user
0.004639000 seconds sys
It appears that the cache misses are fairly small for both b4 and b5, but the branch misses are very high in my system.
-- Krishna