Newsportal USENET - Re: Implementing DOES>: How not to do it (and why not) and how to do it

On 7/14/24 13:32, Krishna Myneni wrote:

On 7/14/24 07:20, Krishna Myneni wrote:
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
>
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
>
>
>
Let's check. In kForth-64, an indirect threaded code system,
>
.s
<empty>
  ok
f.s
fs: <empty>
  ok
ms@ b4 ms@ swap - .
4274 ok
ms@ b5 ms@ swap - .
3648 ok
>
So b5 appears to be more efficient that b4 ( the version with DOES> ).
>
-- Krishna
>
=== begin code ===
50000000 constant iterations
>
: faccum create 1 floats allot? 0.0e f!
     does> dup f@ f+ fdup f! ;
>
: faccum-part2 ( F: r1 -- r2 ) ( a -- )
     dup f@ f+ fdup f! ;
>
faccum x4 2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop
>
: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
     [ ' x4 >body ] literal faccum-part2
     [ ' y4 >body ] literal faccum-part2
   loop ;
=== end code ===
>
>
>
>
Using perf to obtain the microbenchmarks for B4 and B5,
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th b4 f. cr bye':
       14_381_951_937      cycles:u
        26_206_810_946      instructions:u     #    1.82 insn per cycle
              58_563        L1-dcache-load-misses:u
              14_742        L1-icache-load-misses:u
          100_122_231       branch-misses:u
       4.501011307 seconds time elapsed
       4.477172000 seconds user
        0.003967000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.
Performance counter stats for 'kforth64 -e include does-microbench.4th b5 f. cr bye':
       11_529_644_734      cycles:u
        18_906_809_683      instructions:u      #    1.64 insn per cycle
              59_605        L1-dcache-load-misses:u
              21_531        L1-icache-load-misses:u
          100_109_360       branch-misses:u
       3.616353010 seconds time elapsed
       3.600206000 seconds user
        0.004639000 seconds sys
It appears that the cache misses are fairly small for both b4 and b5, but the branch misses are very high in my system.

The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch misses were quite few.
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b4 f. cr bye"
0
Goodbye.
Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr bye':
7_847_499_582 cycles:u 26_206_205_780 instructions:u # 3.34 insn per cycle 67_785 L1-dcache-load-misses:u 65_391 L1-icache-load-misses:u 38_308 branch-misses:u 2.014078890 seconds time elapsed
2.010013000 seconds user
0.000999000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b5 f. cr bye"
0
Goodbye.
Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr bye':
5_314_718_609 cycles:u 18_906_206_178 instructions:u # 3.56 insn per cycle 64_150 L1-dcache-load-misses:u 44_818 L1-icache-load-misses:u 29_941 branch-misses:u 1.372367863 seconds time elapsed
1.367289000 seconds user
0.002989000 seconds sys
The efficiency difference is due entirely to the number of instructions being executed for B4 and B5.
--
KM

Date	Sujet	#	Auteur
11 Jul 24	Implementing DOES>: How not to do it (and why not) and how to do it	12	Anton Ertl
13 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	7	Anton Ertl
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	6	albert
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	4	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	2	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Anton Ertl
3 Aug 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	4	FFmike
4 Aug 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	3	albert
4 Aug 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	2	FFmike
4 Aug 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	FFmike