Implementing DOES>: How not to do it (and why not) and how to do it

Liste des GroupesRevenir à cl forth 
Sujet : Implementing DOES>: How not to do it (and why not) and how to do it
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.lang.forth
Date : 11. Jul 2024, 15:06:02
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Jul11.160602@mips.complang.tuwien.ac.at>
User-Agent : xrn 10.11
At least one Forth system implements DOES> inefficiently, but I
suspect that it's not alone in that.  I have reported the issue to the
system vendor, and hopefully they will fix it.  Here I discuss the
issue so that possibly other system implementors can avoid these
pitfalls.  I do not name the system here to protect the guilty, but
make no additional effort at anonymizing it.

Here's a microbenchmark that puts a spotlight on the issue.  I have
started investingating the issue because the system performed badly on
an application benchmark that calls DOES>-defined words a lot, so this
microbenchmark is not just contrived.

50000000 constant iterations

: d1 create 0 , does> ;
: d1-part2 ;

d1 x1
d1 y1

: b1 iterations 0 do x1 dup ! y1 dup ! loop ;
: c1 iterations 0 do x1 drop  y1 drop loop ;

: b3
    iterations 0 do
        [ ' x1 >body ] literal d1-part2 dup !
        [ ' y1 >body ] literal d1-part2 dup !
    loop ;
: c3
    iterations 0 do
        [ ' x1 >body ] literal d1-part2 drop
        [ ' y1 >body ] literal d1-part2 drop
    loop ;

B1 and C1 use the DOES>-defined words in the way the system implements
them, while B3 and C3 use them in the way the system should implement
them (when COMPILE,ing the xt of X1 and Y1): Put the address of the
body on the data stack as a literal and then call the code behind
DOES> (or in this case a colon definition that contains the code).

Let's see the performance (all numbers from a Tiger Lake), first for
C3/C1:

perf stat -e cycles:u -e instructions:u -e L1-dcache-load-misses -e L1-icache-load-misses -e branch-misses sf64 "include does-microbench.fs c3 bye"


 Performance counter stats for 'sf64 include does-microbench.fs c3 bye':

  402963130      cycles:u
  804105655      instructions:u                   #    2.00  insn per cycle
      83766      L1-dcache-load-misses
    1750283      L1-icache-load-misses
      36403      branch-misses

0.114868603 seconds time elapsed

The code for the X1 part of the loop here is:

44EB26   -8 [RBP] RBP LEA               488D6DF8
44EB2A   RBX 0 [RBP] MOV                48895D00
44EB2E   44E970 # EBX MOV               BB70E94400
44EB33   44E94D ( d1-part2 ) CALL       E815FEFFFF
44EB38   0 [RBP] RBX MOV                488B5D00
44EB3C   8 [RBP] RBP LEA                488D6D08

and the DS1-PART is:

44E94D   RET                            C3

This loop takes 8 cycles per iteration (about half of that for each
simulated DOES>-defined word).  Now for C1:

C1:

3502930384      cycles:u
  903847649      instructions:u                   #    0.26  insn per cycle
      93579      L1-dcache-load-misses
    1813286      L1-icache-load-misses
  100033846      branch-misses

0.846190766 seconds time elapsed

This loop takes 70 cycles per iteration (i.e., almost 9 times slower)
and has one branch misprediction per DOES>-defined word.  Let's see
why:

The code in the loop for the X1 part is:

44EA42   44E96B ( x1 ) CALL             E824FFFFFF
44EA47   0 [RBP] RBX MOV                488B5D00
44EA4B   8 [RBP] RBP LEA                488D6D08

It calls X1:

44E96B   44E927 ( d1 +1C ) CALL         E8B7FFFFFF

which in turn calls this code:

44E927   -8 [RBP] RBP LEA               488D6DF8
44E92B   RBX 0 [RBP] MOV                48895D00
44E92F   RBX POP                        5B
44E930   RET                            C3

Here the return address of the call at 44E96B is popped and used as
body address for X1.  However, the hardware return stack predicts that
the following RET returns to that return address, which is a
misprediction, because the RET actually returns to the return address
of the outer call (44EA47).  You can see here that mispredictions are
expensive.

Let's turn to B1.  The code for the X1 part of the loop is:

44E9CE   44E96B ( x1 ) CALL             E898FFFFFF
44E9D3   -8 [RBP] RBP LEA               488D6DF8
44E9D7   RBX 0 [RBP] MOV                48895D00
44E9DB   0 [RBP] RAX MOV                488B4500
44E9DF   RAX 0 [RBX] MOV                488903
44E9E2   8 [RBP] RBX MOV                488B5D08
44E9E6   10 [RBP] RBP LEA               488D6D10

The results are:

 44758764805      cycles:u
  1303768694      instructions:u                   #    0.03  insn per cycle
   150154275      L1-dcache-load-misses
   202142121      L1-icache-load-misses
   100051173      branch-misses

10.699859443 seconds time elapsed

10.696626000 seconds user
 0.004000000 seconds sys

So in addition to the branch mispredictions that we also saw in C1, we
see 3 D-cache misses per iteration and 4 I-cache misses per iteration,
resulting in 895 cycles/iteration.  A part of this cache-ping-pong is
because the call at 44E96B is in the same cache line as the stored-to
data at 44E970.  The store wants the cache line in the D-cache, while
the call to 44E96B wants it in the I-cache.  This is an issue I have
pointed out here for the first time in 1995, and Forth systems still
have not fixed it completely 29 years later.

Let's see if the B3 variant escapes this problem:

0.106679000 seconds user
0.008206000 seconds sys
20590473606      cycles:u
 1204106872      instructions:u                   #    0.06  insn per cycle
   50145367      L1-dcache-load-misses
  101982668      L1-icache-load-misses
      49127      branch-misses

4.932825183 seconds time elapsed

4.926391000 seconds user
0.003998000 seconds sys

It's better, but there is still one D-cache miss and 2 I-cache misses
per iteration.  It seems that the distance between the D1-PART2 code
and X1 data is not enough to completely avoid the issue (but this is
just guessing).

In any case, having the call right before the data and executing it on
every call to a does>-defined word is responsible for 2 I-cache misses
and 2 D-cache misses per iteration, so it should be avoided by
generating code like in C3 and B3.

For dealing with the remaining cache consistency problems, most Forth
systems have chosen to put padding between code and data, and increase
the padding in response to slowness reports.  Another alternative is
to put the code in a separate memory region than the data.

- anton
--
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
   EuroForth 2024: https://euro.theforth.net

Date Sujet#  Auteur
11 Jul 24 * Implementing DOES>: How not to do it (and why not) and how to do it12Anton Ertl
13 Jul 24 +* Re: Implementing DOES>: How not to do it (and why not) and how to do it7Anton Ertl
14 Jul 24 i`* Re: Implementing DOES>: How not to do it (and why not) and how to do it6albert
14 Jul 24 i +* Re: Implementing DOES>: How not to do it (and why not) and how to do it4Krishna Myneni
14 Jul 24 i i+- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Krishna Myneni
14 Jul 24 i i`* Re: Implementing DOES>: How not to do it (and why not) and how to do it2Krishna Myneni
14 Jul 24 i i `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Krishna Myneni
14 Jul 24 i `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1Anton Ertl
3 Aug 24 `* Re: Implementing DOES>: How not to do it (and why not) and how to do it4FFmike
4 Aug 24  `* Re: Implementing DOES>: How not to do it (and why not) and how to do it3albert
4 Aug 24   `* Re: Implementing DOES>: How not to do it (and why not) and how to do it2FFmike
4 Aug 24    `- Re: Implementing DOES>: How not to do it (and why not) and how to do it1FFmike

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal