Interpreters and indirect-branch prediction (was: Reverse ...)

Liste des GroupesRevenir à c arch 
Sujet : Interpreters and indirect-branch prediction (was: Reverse ...)
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 14. Nov 2024, 09:35:41
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Nov14.093541@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : xrn 10.11
Thomas Koenig <tkoenig@netcologne.de> writes:
[Why indirect-branch predictors work in "bytecode" interpreters]
To me, that looks like the bytecode is insufficiently expressive, if
an often-repeating sequence occurs :-)

As already mentioned, what a hardware branch predictor works on is the
dynamic sequence of branches.

This might come from a straight-line sequence of VM instructions that
occurs in only once in one program, and nowhere else; e.g., in a loop
(as mentioned earlier), or in a subroutine that's called from several
places.  If that one program is not among the programs that the VM
designer used for determining frequent code sequences that should be
turned into static VM superinstructions, on what basis should that VM
superinstruction be added?  Even if that program is among those
programs, should this VM superinstruction be added?  It's only useful
in one program.

But the hardware branch prediction is also based on dynamic sequences
that are not statically sequential.  E.g., there might be a VM branch
(possibly conditional or indirect), a VM call or a VM return in that
sequence.  The hardware branch predictor will use the repetition of
the sequence for prediction, but it's usually impractical to form
static VM superinstructions from such sequences.

And in any case, whatever VM superinstructions you use, at some point
every VM superinstruction ends with an indirect branch, and then you
need a good hardware indirect-branch predictor, or the performance
will suffer.

If you have a good indirect-branch predictor, there are no such
boundaries; it predicts the indirect branch of every VM instruction
implementation based on the preceeding branches and their targets.

In the presence of a good predictor, it would still be an interesting
question if a more compressed version would actually perform better.

There is a benefit to using VM superinstructions: the number of
instruction dispatches is reduced, and consequently less code has to
be executed; there are also less branches performed, so the
instruction fetcher can be more effective.

You can see the effect in the difference between "--no-super" and
"default" in
<http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
Here's that data extracted for your convenience; result on Haswell:

|Run time in seconds:
|
|
|--no-
|super  default
|2.276    1.468 benchgc 
|1.064    0.544 brainless
|2.288    1.520 brew    
|2.232    1.232 cd16sim 
|1.484    0.864 fcp     
|9.280    7.840 lexex   

|For mispredictions the picture was as follows:
|
|     --no-
|     super     default
|12,584,698   9,426,835 benchgc 
|51,375,412  18,391,920 brainless
|47,928,843  21,390,209 brew    
|70,879,605  22,105,620 cd16sim 
|33,496,775  17,280,944 fcp     
| 8,269,104   6,885,712 lexex   

"default" here is dynamic superinstructions, i.e., every straight-line
sequence of code (even across conditional branches) is turned into a
VM superinstruction; i.e., the only indirect branches are for taken
VM-level branches.  As you can see, the speedup from dynamic
superinstructions is quite nice, and only a little of it is due to
better branch predictions.  E.., in the cd16sim case (with the largest
reduction in mispredictions, the mispredictions account for about

((71-22)M mispredictions)*(20cycles/misprediction)/(4400M cycles/s)=0.22s

out of 1s, so about 0.78s are due to the reduction in dispatch code.

These dynamic superinstructions are formed by concatenating the code
snippets coming from the individual VM instructions, so eliminating
dispatch code is the only benefit for them.

In recent work [ertl&paysan24] we have also reduced VM-instruction
pointer updates in Gforth, and found speedups by a factor >2 on
loop-dominated benchmarks on recent high-performance
microarchitectures; our explanation is that the VM instruction pointer
updates are the critical dependence path in executing these benchmarks
unless we use our reduction techniques.

In the early 2000s we looked at using static VM superinstructions.  At
the time the main effect was the reduction in mispredicted indirect
branches, but they can also have the effect of reducing other
instructions.  E.g., in Gforth (engine gforth-fast) the VM instruction
F@ has the following code snippet (Intel-style syntax):

560BC7692D89:   add     r13,$08
560BC7692D8D:   movsd   [r12],xmm15
560BC7692D93:   movsd   xmm15,[r8]
560BC7692D98:   sub     r12,$08
560BC7692D9C:   mov     r8,$00[r13]

and the VM instruction F+ has the following code snippet:

560BC7692E3B:   mov     rax,r12
560BC7692E3E:   lea     r12,$08[r12]
560BC7692E43:   addsd   xmm15,$08[rax]

Register assignments:

r13: data-stack pointer
r8: top of data stack
r12: FP-stack pointer
xmm15: top of FP stack

(Don't ask me why gcc introduced the mov in 560BC7692E3B rather than
doing the lea afterwards or using the result of the lea in the addsd.)

Gforth has a static VM superinstruction for the sequence "F@ F+".
It's code is:

7FC65EA14E20:   add     r13,$08
7FC65EA14E24:   addsd   xmm15,[r8]
7FC65EA14E29:   mov     r8,$00[r13]

In this case, the static VM superinstruction avoided a push and pop
from the FP stack, and the associated code (FP-stack pointer update,
and storing and loading the top of the FP stack from memory).

We first implemented static superinstructions and tried up to 1600 of
them [ertl&gregg03], mainly for improved branch prediction on the CPUs
of the time (which used the BTB for indirect-branch prediction).  When
dynamic superinstructions (with replication) solved the branch
prediction problem, the remaining benefit of static superinstructions
is the one above, but it only helps for some sequences, but not
others.  In Gforth 0.7, we have 13 static superinstructions, in the
current development version we have 41.  Among the reasons to be
stingy with adding static superinstructions are:

* The more VM instruction implementations we add, the longer the
  compile time of the VM becomes.

* The effect of static superinstructions in addition to dynamic
  superinstructions is quite small (see "with static super" in Figure
  10 [ertl&gregg03]); and the first few provide most of the benefit
  (see below).

* At one point we had 400, but the engine crashed (probably not with
  all gcc versions, or we would not have been able to produc numbers
  for up to 1600).  So we switched around, and only had static
  superinstructions for the most frequent sequences where we also
  expect a benefit like the one shown above.

Looking at our 2003 paper [ertl&gregg03] again, I also see an answer
for your first question: Look at Figure 13.  There you see the the
effect of static superinstructions and static replicas of VM
instructions.  Static replicas are just that: Instead of + you now
have +A +B +C, and use them round-robin whenever an instance of +
occurs in the VM program; the idea here was just to improve the branch
prediction on microarchitectures with BTB-based indirect-branch
prediction.  It turns out that the balance between static
superinstructions and static replicas was that having a relatively
small number of static superinstructions (e.g., about 200 out of a
total 1600 additional VM instruction implementations, or 35 out of a
total 400) provides a benefit, but if you increase the number of
superinstructions more, you get a slowdown.  My explanation is that
the additional superinstructions reduce the number of replicas, but
help in only a few places, whereas the additional replicas help
in a lot of places.

Conversely, there are a lot of sequences that occur throughout the
code that are not covered by 1600 static superinstructions or less.
So how many static superinstructions would you want to add before you
consider the VM to be "sufficiently expressive"?

- anton

@InProceedings{ertl&paysan24,
  author = {Ertl, M. Anton and Paysan, Bernd},
  title = {{The Performance Effects of Virtual-Machine Instruction Pointer Updates}},
  booktitle = {38th European Conference on Object-Oriented Programming (ECOOP 2024)},
  pages = {14:1--14:26},
  series = {Leibniz International Proceedings in Informatics (LIPIcs)},
  ISBN = {978-3-95977-341-6},
  ISSN = {1868-8969},
  year = {2024},
  volume = {313},
  editor = {Aldrich, Jonathan and Salvaneschi, Guido},
  publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik},
  address = {Dagstuhl, Germany},
  URL = {https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2024.14},
  URN = {urn:nbn:de:0030-drops-208634},
  doi = {10.4230/LIPIcs.ECOOP.2024.14},
  annote = {Keywords: virtual machine, interpreter, out-of-order execution},
  abstract =     {How much performance do VM instruction-pointer (IP)
                  updates cost and how much benefit do we get from
                  optimizing them away?  Two decades ago it had little
                  effect on the hardware of the day, but on recent
                  hardware the dependence chain of IP updates can
                  become the critical path on processors with
                  out-of-order execution.  In particular, this happens
                  if the VM instructions are light-weight and the
                  application programs are loop-dominated.  The
                  present work presents several ways of reducing or
                  eliminating the dependence chains from IP updates,
                  either by breaking the dependence chains with the
                  \emph{l}oop optimization or by reducing the number
                  of IP updates (the \emph{c} and \emph{ci}
                  optimizations) or their latency (the \emph{b}
                  optimization).  Some benchmarks see speedups from
                  these optimizations by factors $>2$ on most recent
                  cores, while other benchmarks and older cores see
                  more modest results, often in the speedup ranges
                  1.1--1.3.}
}

@InProceedings{ertl&gregg03,
  author =       "M. Anton Ertl and David Gregg",
  title =        "Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters",
  crossref =     "sigplan03",
  OPTpages = "",
  url = "https://www.complang.tuwien.ac.at/papers/ertl%26gregg03.ps.gz",
  abstract =     "Interpreters designed for efficiency execute a huge
                  number of indirect branches and can spend more than
                  half of the execution time in indirect branch
                  mispredictions.  Branch target buffers are the best
                  widely available\mn{on all recent general-purpose
                  machines?} form of indirect branch prediction;
                  however, their prediction accuracy for existing
                  interpretes is only 2\%--50\%.  In this paper we
                  investigate two methods for improving the prediction
                  accuracy of BTBs for interpreters: replicating
                  virtual machine (VM) instructions and combining
                  sequences of VM instructions into superinstructions.
                  We investigate static (interpreter build-time) and
                  dynamic (interpreter run-time) variants of these
                  techniques and compare them and several combinations
                  of these techniques.  These techniques can eliminate
                  nearly all of the dispatch branch mispredictions,
                  and have other benefits, resulting in speedups by a
                  factor of up to 3.17 over efficient threaded-code
                  interpreters, and speedups by a factor of up to 1.3
                  over techniques relying on superinstructions alone."
}

@Proceedings{sigplan03,
  booktitle =    "SIGPLAN Conference on Programming Language
                  Design and Implementation (PLDI'03)",
  title =        "SIGPLAN Conference on Programming Language
                  Design and Implementation (PLDI'03)",
  year =         "2003",
  key =          "PLDI '03"
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
23 Oct 24 * Reverse engineering of Intel branch predictors34Thomas Koenig
23 Oct 24 +* Re: Reverse engineering of Intel branch predictors24MitchAlsup1
28 Oct 24 i`* Re: Reverse engineering of Intel branch predictors23Stefan Monnier
5 Nov 24 i `* Re: Reverse engineering of Intel branch predictors22MitchAlsup1
11 Nov 24 i  +* Re: Reverse engineering of Intel branch predictors2Thomas Koenig
11 Nov 24 i  i`- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24 i  `* Re: Reverse engineering of Intel branch predictors19Stefan Monnier
11 Nov 24 i   `* Re: Reverse engineering of Intel branch predictors18MitchAlsup1
12 Nov 24 i    `* Re: Reverse engineering of Intel branch predictors17Stefan Monnier
12 Nov 24 i     `* Re: Reverse engineering of Intel branch predictors16MitchAlsup1
12 Nov 24 i      +* Re: Reverse engineering of Intel branch predictors13Stefan Monnier
12 Nov 24 i      i`* Re: Reverse engineering of Intel branch predictors12MitchAlsup1
13 Nov 24 i      i +* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
13 Nov 24 i      i i`* Re: Reverse engineering of Intel branch predictors6Terje Mathisen
13 Nov 24 i      i i `* Re: Reverse engineering of Intel branch predictors5Stefan Monnier
13 Nov 24 i      i i  `* Re: Reverse engineering of Intel branch predictors4Thomas Koenig
13 Nov 24 i      i i   +* Re: Reverse engineering of Intel branch predictors2Stefan Monnier
14 Nov 24 i      i i   i`- Re: Reverse engineering of Intel branch predictors1Thomas Koenig
14 Nov 24 i      i i   `- Interpreters and indirect-branch prediction (was: Reverse ...)1Anton Ertl
13 Nov 24 i      i `* Interpreters and indirect-branch prediction4Anton Ertl
13 Nov 24 i      i  `* Re: Interpreters and indirect-branch prediction3MitchAlsup1
13 Nov 24 i      i   `* Re: Interpreters and indirect-branch prediction2BGB
14 Nov 24 i      i    `- Re: Interpreters and indirect-branch prediction1BGB
13 Nov 24 i      `* Re: Reverse engineering of Intel branch predictors2Brett
13 Nov 24 i       `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
1 Nov 24 `* Re: Reverse engineering of Intel branch predictors9Waldek Hebisch
1 Nov 24  +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
5 Nov 24  `* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
5 Nov 24   +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
8 Nov 24   `* Re: Reverse engineering of Intel branch predictors5Waldek Hebisch
8 Nov 24    +* Re: Reverse engineering of Intel branch predictors3MitchAlsup1
10 Nov 24    i`* Re: Reverse engineering of Intel branch predictors2Waldek Hebisch
10 Nov 24    i `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24    `- Re: Reverse engineering of Intel branch predictors1Stefan Monnier

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal