Re: Implementing DOES>: How not to do it (and why not) and how to do it

Liste des GroupesRevenir à cl forth 
Sujet : Re: Implementing DOES>: How not to do it (and why not) and how to do it
De : krishna.myneni (at) *nospam* ccreweb.org (Krishna Myneni)
Groupes : comp.lang.forth
Date : 14. Jul 2024, 20:28:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v718t1$9agb$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 7/14/24 13:32, Krishna Myneni wrote:
On 7/14/24 07:20, Krishna Myneni wrote:
On 7/14/24 04:02, albert@spenarnc.xs4all.nl wrote:
In article <2024Jul13.173138@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
<SNIP>
>
In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.
>
Is that equally valid for indirect threaded code?
In indirect threaded code the instruction and data cache
are more separated, e.g. in a simple Forth all the low level
code could fit in the I-cache, if I'm not mistaken.
>
>
>
Let's check. In kForth-64, an indirect threaded code system,
>
.s
<empty>
  ok
f.s
fs: <empty>
  ok
ms@ b4 ms@ swap - .
4274  ok
ms@ b5 ms@ swap - .
3648  ok
>
So b5 appears to be more efficient that b4 ( the version with DOES> ).
>
-- Krishna
>
=== begin code ===
50000000 constant iterations
>
: faccum  create 1 floats allot? 0.0e f!
     does> dup f@ f+ fdup f! ;
>
: faccum-part2 ( F: r1 -- r2 ) ( a -- )
     dup f@ f+ fdup f! ;
>
faccum x4  2.0e x4 fdrop
faccum y4 -4.0e y4 fdrop
>
: b4 0.0e iterations 0 do x4 y4 loop ;
: b5 0.0e iterations 0 do
     [ ' x4 >body ] literal faccum-part2
     [ ' y4 >body ] literal faccum-part2
   loop ;
=== end code ===
>
>
>
>
 Using perf to obtain the microbenchmarks for B4 and B5,
 B4
 $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b4 f. cr bye"
-inf
Goodbye.
   Performance counter stats for 'kforth64 -e include does-microbench.4th b4 f. cr bye':
         14_381_951_937      cycles:u
        26_206_810_946      instructions:u     #    1.82  insn per cycle
              58_563        L1-dcache-load-misses:u
              14_742        L1-icache-load-misses:u
          100_122_231       branch-misses:u
         4.501011307 seconds time elapsed
         4.477172000 seconds user
        0.003967000 seconds sys
  B5
 $ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include does-microbench.4th b5 f. cr bye"
-inf
Goodbye.
   Performance counter stats for 'kforth64 -e include does-microbench.4th b5 f. cr bye':
         11_529_644_734      cycles:u
        18_906_809_683      instructions:u      #    1.64  insn per cycle
              59_605        L1-dcache-load-misses:u
              21_531        L1-icache-load-misses:u
          100_109_360       branch-misses:u
         3.616353010 seconds time elapsed
         3.600206000 seconds user
        0.004639000 seconds sys
  It appears that the cache misses are fairly small for both b4 and b5, but the branch misses are very high in my system.
 
The prior micro-benchmarks were run on an old AMD A10-9600P @ 2.95 GHz.
On a newer system with an Intel Core i5-8400 @ 2.8 GHz, the branch misses were quite few.
B4
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b4 f. cr bye"
0
Goodbye.
  Performance counter stats for 'kforth64 -e include faccum.4th b4 f. cr bye':
         7_847_499_582      cycles:u         26_206_205_780      instructions:u     #    3.34  insn per cycle               67_785        L1-dcache-load-misses:u               65_391        L1-icache-load-misses:u               38_308        branch-misses:u         2.014078890 seconds time elapsed
        2.010013000 seconds user
        0.000999000 seconds sys
B5
$ LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses kforth64 -e "include faccum.4th b5 f. cr bye"
0
Goodbye.
  Performance counter stats for 'kforth64 -e include faccum.4th b5 f. cr bye':
         5_314_718_609      cycles:u         18_906_206_178      instructions:u     #    3.56  insn per cycle               64_150        L1-dcache-load-misses:u               44_818        L1-icache-load-misses:u               29_941        branch-misses:u         1.372367863 seconds time elapsed
        1.367289000 seconds user
        0.002989000 seconds sys
The efficiency difference is due entirely to the number of instructions being executed for B4 and B5.
--
KM

Date Sujet#  Auteur
11 Jul 24 * Implementing DOES>: How not to do it (and why not) and how to do it12Anton Ertl
13 Jul 24 +* Re: Implementing DOES>: How not to do it (and why not) and how to do it7Anton Ertl
14 Jul 24 i`* Re: Implementing DOES>: How not to do it (and why not) and how to do it6albert
14 Jul 24 i +* Re: Implementing DOES>: How not to do it (and why not) and how to do it4Krishna Myneni
14 Jul 24 i i+- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Krishna Myneni
14 Jul 24 i i`* Re: Implementing DOES>: How not to do it (and why not) and how to do it2Krishna Myneni
14 Jul 24 i i `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Krishna Myneni
14 Jul 24 i `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Anton Ertl
3 Aug 24 `* Re: Implementing DOES>: How not to do it (and why not) and how to do it4FFmike
4 Aug 24  `* Re: Implementing DOES>: How not to do it (and why not) and how to do it3albert
4 Aug 24   `* Re: Implementing DOES>: How not to do it (and why not) and how to do it2FFmike
4 Aug 24    `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1FFmike

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal