Interpreters and indirect-branch prediction

Liste des GroupesRevenir à c arch 
Sujet : Interpreters and indirect-branch prediction
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 13. Nov 2024, 09:20:27
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Nov13.092027@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10
User-Agent : xrn 10.11
mitchalsup@aol.com (MitchAlsup1) writes:
For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of
80% [ertl&gregg03jilp].  With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

Hardware designers improved indirect-branch predictors to deal with
more history than BTBs with 2-bit counters already in Dothan (launched
2004) and it's descendents (in particular, the Core 2 line), judging
by the results of <http://www.complang.tuwien.ac.at/forth/threading/>
(look in the column "switch").

In 2015 Rohou et al. measured the prediction accuracy in Python,
JavaScript, and a CLI (.NET virtual machine, also known under other
names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
simulated TAGE1 branch predictor); unfortunately, they gave results in
mispredictions per (kilo-)instructions rather than mispredictions per
prediction and they used different interpreters, so you cannot compare
that work with our earlier work.  In any case, they found good
improvements from Nehalem to Haswell.

Inspired by this work, I did some measurements of Gforth, disabling
various optimizations (which reduce the branch mispredictions) and
posted them here.  You can find the sequence of postings here:
<http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
However, I did not measure the prediction accuracy, because the idea
was to check whether the claims made in the paper (e.g., "Modern
predictors no longer need [superinstructions] to capture patterns in
applications." or "[prediction accuracy] can not be considered as an
obstacle for performance anymore.") are true.

In the Haswell results, using an interpreter with non-shared indirect
branches (non-shared --no-dynamic) does indeed often have a similar
prediction accuracy as the interpreter with one shared indirect branch
(shared --no-dynamic), but eliminating the sharing still provides a
small speedup; and dynamic replication (--no-super), which results in
having one indirect branch per instance of the VM instruction in the
interpreted program, i.e., no sharing between different instances of
the same VM instruction, provides a good reduction in branch
mispredictions and in execution time on several benchmarks.  And given
that "non-shared --no-dynamic" and "--no-super" actually execute the
same number and sequence of instructions (only with replication in the
"--no-super" case), any improvement in run-time is only due to the
improved branch prediction; here I only show these two columns, to
avoid confusion; look at
<http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
for the full data set:

Run time in seconds:

non-shared
     --no-      --no-
   dynamic      super
     2.440      2.276 benchgc 
     1.524      1.064 brainless
     3.416      2.288 brew    
     3.236      2.232 cd16sim 
     2.684      1.484 fcp     
     9.848      9.280 lexex   

For mispredictions the picture was as follows:

 non-shared
      --no-       --no-
    dynamic       super
 17,005,970  12,584,698 benchgc 
178,234,917  51,375,412 brainless
279,851,195  47,928,843 brew    
297,000,355  70,879,605 cd16sim 
293,473,631  33,496,775 fcp     
 12,267,065   8,269,104 lexex   

So the improvment in branch prediction provides speedup factors
between 1.06 and 1.81 on these benchmarks.  Plus, the results for the
interpreters with a shared indirect branch (what the usual way of
writing a switch-based interpreter gives you) are even slower than for
the "non-shared --no-dynamic".  So to paraphrase the title of Rohou et
al.'s paper, don't trust Rohou et al.

However, I expect that the indirect branch predictors have improved
since Haswell, though, and maybe we now see much lower numbers of
mispredictions even for "non-shared --no-dynamic" and for "shared
--no-dynamic", but I would have to measure that.

Anyway, back to Mitch Alsup's original claim: No, for a bytecode
interpreter the prediction accuracy of an indirect-branch predictor is
not necessarily going to be especially bad.  It depends on how good
the indirect-branch predictor is, and on the benchmark.

@Article{ertl&gregg03jilp,
  author = {M. Anton Ertl and David Gregg},
  title = {The Structure and Performance of \emph{Efficient}
                  Interpreters},
  journal = {The Journal of Instruction-Level Parallelism},
  year = {2003},
  volume = {5},
  month = nov,
  url =         {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
  url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
  note = {http://www.jilp.org/vol5/},
  abstract = {Interpreters designed for high general-purpose
                  performance typically perform a large number of
                  indirect branches (3.2\%--13\% of all executed
                  instructions in our benchmarks). These branches
                  consume more than half of the run-time in a number
                  of configurations we simulated. We evaluate how
                  accurate various existing and proposed branch
                  prediction schemes are on a number of interpreters,
                  how the mispredictions affect the performance of the
                  interpreters and how two different interpreter
                  implementation techniques perform with various
                  branch predictors. We also suggest various ways in
                  which hardware designers, C compiler writers, and
                  interpreter writers can improve the performance of
                  interpreters.}
}

@InProceedings{rohou+15,
  author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
  title = {Branch Prediction and the Performance of
                  Interpreters --- Don't Trust Folklore},
  booktitle = {Code Generation and Optimization (CGO)},
  OPTpages = {},
  year = {2015},
  url = {https://hal.inria.fr/hal-01100647/document},
  annote = {Evaluates the indirect branch predictors of Nehalem,
                  Sandy Bridge, and Haswell and the ITTAGE predictor
                  on the interpreters of Python, Spidermonkey
                  (JavaScript), and a (.NET) CLI interpreter running a
                  number of benchmarks.  Haswell and ITTAGE are very
                  good at branch prediction for these benchmarks, and
                  they suggest that switch-based interpreters good
                  enough because of that.  However, my own results
                  \url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>}
                  show that shared dispatch branches still can produce
                  a significant slowdown.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
23 Oct 24 * Reverse engineering of Intel branch predictors34Thomas Koenig
23 Oct 24 +* Re: Reverse engineering of Intel branch predictors24MitchAlsup1
28 Oct 24 i`* Re: Reverse engineering of Intel branch predictors23Stefan Monnier
5 Nov 24 i `* Re: Reverse engineering of Intel branch predictors22MitchAlsup1
11 Nov 24 i  +* Re: Reverse engineering of Intel branch predictors2Thomas Koenig
11 Nov 24 i  i`- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24 i  `* Re: Reverse engineering of Intel branch predictors19Stefan Monnier
11 Nov 24 i   `* Re: Reverse engineering of Intel branch predictors18MitchAlsup1
12 Nov 24 i    `* Re: Reverse engineering of Intel branch predictors17Stefan Monnier
12 Nov 24 i     `* Re: Reverse engineering of Intel branch predictors16MitchAlsup1
12 Nov 24 i      +* Re: Reverse engineering of Intel branch predictors13Stefan Monnier
12 Nov 24 i      i`* Re: Reverse engineering of Intel branch predictors12MitchAlsup1
13 Nov 24 i      i +* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
13 Nov 24 i      i i`* Re: Reverse engineering of Intel branch predictors6Terje Mathisen
13 Nov 24 i      i i `* Re: Reverse engineering of Intel branch predictors5Stefan Monnier
13 Nov 24 i      i i  `* Re: Reverse engineering of Intel branch predictors4Thomas Koenig
13 Nov 24 i      i i   +* Re: Reverse engineering of Intel branch predictors2Stefan Monnier
14 Nov 24 i      i i   i`- Re: Reverse engineering of Intel branch predictors1Thomas Koenig
14 Nov 24 i      i i   `- Interpreters and indirect-branch prediction (was: Reverse ...)1Anton Ertl
13 Nov 24 i      i `* Interpreters and indirect-branch prediction4Anton Ertl
13 Nov 24 i      i  `* Re: Interpreters and indirect-branch prediction3MitchAlsup1
13 Nov 24 i      i   `* Re: Interpreters and indirect-branch prediction2BGB
14 Nov 24 i      i    `- Re: Interpreters and indirect-branch prediction1BGB
13 Nov 24 i      `* Re: Reverse engineering of Intel branch predictors2Brett
13 Nov 24 i       `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
1 Nov 24 `* Re: Reverse engineering of Intel branch predictors9Waldek Hebisch
1 Nov 24  +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
5 Nov 24  `* Re: Reverse engineering of Intel branch predictors7Stefan Monnier
5 Nov 24   +- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
8 Nov 24   `* Re: Reverse engineering of Intel branch predictors5Waldek Hebisch
8 Nov 24    +* Re: Reverse engineering of Intel branch predictors3MitchAlsup1
10 Nov 24    i`* Re: Reverse engineering of Intel branch predictors2Waldek Hebisch
10 Nov 24    i `- Re: Reverse engineering of Intel branch predictors1MitchAlsup1
11 Nov 24    `- Re: Reverse engineering of Intel branch predictors1Stefan Monnier

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal