Newsportal USENET - Re: Implementing DOES>: How not to do it (and why not) and how to do it

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that.

And indeed, a second system has the same problem; it shows up more
rarely, because normally this system inlines does>-defined words, but
when it does not, it performs badly.

Here's a microbenchmark where the second system does not inline the
does-defined word:

50000000 constant iterations
: faccum
create 0e f,
does> ( r1 -- r2 )
dup f@ f+ fdup f! ;

: faccum-part2 ( r1 addr -- r2 )
dup f@ f+ fdup f! ;

faccum x4 \ 2e x4 fdrop
faccum y4 \ -4e y4 fdrop

: b4 0e iterations 0 do x4 y4 loop ;
: b5 0e iterations 0 do
[ ' x4 >body ] literal faccum-part2
[ ' y4 >body ] literal faccum-part2
   loop ;

First, let's see what the Forth systems do by themselves (the B4
microbenchmark); numbers from a Skylake; I have replaced the names of
the Forth systems with inefficient DOES> implementations with A and B.

[~/forth:150659] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b4 f. cr bye"
0.

Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b4 f. cr bye':

   948_628_907 cycles:u
   3_695_796_028 instructions:u # 3.90 insn per cycle
   1_154_670 L1-dcache-load-misses
   198_627 L1-icache-load-misses
   306_507 branch-misses

   0.245984689 seconds time elapsed

   0.244894000 seconds user
   0.000000000 seconds sys

[~/forth:150660] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b4 f. cr bye"
0.00000000

Performance counter stats for 'A include does-microbench.fs b4 f. cr bye':

38_769_505_700 cycles:u
   1_704_476_397 instructions:u # 0.04 insn per cycle
   178_288_238 L1-dcache-load-misses
   250_454_606 L1-icache-load-misses
   100_090_310 branch-misses

   9.719803719 seconds time elapsed

   9.715343000 seconds user
   0.000000000 seconds sys

[~/forth:150661] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b4 f. cr bye"

Including does-microbench.fs0.

Performance counter stats for 'B include does-microbench.fs b4 f. cr bye':

39_200_313_445 cycles:u
   1_413_936_888 instructions:u # 0.04 insn per cycle
   150_445_572 L1-dcache-load-misses
   209_127_540 L1-icache-load-misses
   100_128_427 branch-misses

   9.822342252 seconds time elapsed

   9.817016000 seconds user
   0.000000000 seconds sys

So both A and B fall into the cache-ping-pong and the return stack
misprediction pitfalls in this case, resulting in a factor 40 slowdown
compared to Gforth.

Let's see how it works if we use the code I suggest (simulated in B5):

[~/forth:150662] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses ~/gforth/gforth-fast -e "include does-microbench.fs b5 f. cr bye"
0.

Performance counter stats for '/home/anton/gforth/gforth-fast -e include does-microbench.fs b5 f. cr bye':

   943_277_009 cycles:u
   3_295_795_332 instructions:u # 3.49 insn per cycle
   1_147_107 L1-dcache-load-misses
   198_364 L1-icache-load-misses
   295_186 branch-misses

   0.247765182 seconds time elapsed

   0.242645000 seconds user
   0.004044000 seconds sys

[~/forth:150663] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses A "include does-microbench.fs b5 f. cr bye"
0.00000000

Performance counter stats for 'A include does-microbench.fs b5 f. cr bye':

23_587_381_659 cycles:u
   1_604_475_561 instructions:u # 0.07 insn per cycle
   100_111_296 L1-dcache-load-misses
   100_502_420 L1-icache-load-misses
77_126 branch-misses

   6.055177414 seconds time elapsed

   6.055288000 seconds user
   0.000000000 seconds sys

[~/forth:150664] LC_NUMERIC=prog perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses B "include does-microbench.fs b5 f. cr bye"

Including does-microbench.fs0.

Performance counter stats for 'B include does-microbench.fs b5 f. cr bye':

   949_044_323 cycles:u
   1_313_933_897 instructions:u # 1.38 insn per cycle
   246_252 L1-dcache-load-misses
   105_517 L1-icache-load-misses
61_449 branch-misses

   0.239750023 seconds time elapsed

   0.239811000 seconds user
   0.000000000 seconds sys

This solves both problems for B, but A still suffers from
cache ping-pong; I suspect that this is because there is not enough
distance between the modified data and FACCUM-PART2 (or, less likely,
not enough distance between the modified data and the loop in B5).

In any case, if you are a system implementor, you may want to check
your DOES> implementation with a microbenchmark that stores into the
does-defined word in a case where that word is not inlined.

- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
   New standard: https://forth-standard.org/
   EuroForth 2024: https://euro.theforth.net

Date	Sujet	#	Auteur
13 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	7	Anton Ertl
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	6	albert
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	4	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	2	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Krishna Myneni
14 Jul 24	Re: Implementing DOES>: How not to do it (and why not) and how to do it	1	Anton Ertl