mitchalsup@aol.com (MitchAlsup1) writes:
For something like a bytecode interpreter, the prediction accuracy
of the jump predictor is going to be epically bad.
And with "jump predictor", you probably mean the indirect-branch
predictor.
On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of
80% [ertl&gregg03jilp]. With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.
We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.
Hardware designers improved indirect-branch predictors to deal with
more history than BTBs with 2-bit counters already in Dothan (launched
2004) and it's descendents (in particular, the Core 2 line), judging
by the results of <
http://www.complang.tuwien.ac.at/forth/threading/>
(look in the column "switch").
In 2015 Rohou et al. measured the prediction accuracy in Python,
JavaScript, and a CLI (.NET virtual machine, also known under other
names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
simulated TAGE1 branch predictor); unfortunately, they gave results in
mispredictions per (kilo-)instructions rather than mispredictions per
prediction and they used different interpreters, so you cannot compare
that work with our earlier work. In any case, they found good
improvements from Nehalem to Haswell.
Inspired by this work, I did some measurements of Gforth, disabling
various optimizations (which reduce the branch mispredictions) and
posted them here. You can find the sequence of postings here:
<
http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
However, I did not measure the prediction accuracy, because the idea
was to check whether the claims made in the paper (e.g., "Modern
predictors no longer need [superinstructions] to capture patterns in
applications." or "[prediction accuracy] can not be considered as an
obstacle for performance anymore.") are true.
In the Haswell results, using an interpreter with non-shared indirect
branches (non-shared --no-dynamic) does indeed often have a similar
prediction accuracy as the interpreter with one shared indirect branch
(shared --no-dynamic), but eliminating the sharing still provides a
small speedup; and dynamic replication (--no-super), which results in
having one indirect branch per instance of the VM instruction in the
interpreted program, i.e., no sharing between different instances of
the same VM instruction, provides a good reduction in branch
mispredictions and in execution time on several benchmarks. And given
that "non-shared --no-dynamic" and "--no-super" actually execute the
same number and sequence of instructions (only with replication in the
"--no-super" case), any improvement in run-time is only due to the
improved branch prediction; here I only show these two columns, to
avoid confusion; look at
<
http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
for the full data set:
Run time in seconds:
non-shared
--no- --no-
dynamic super
2.440 2.276 benchgc
1.524 1.064 brainless
3.416 2.288 brew
3.236 2.232 cd16sim
2.684 1.484 fcp
9.848 9.280 lexex
For mispredictions the picture was as follows:
non-shared
--no- --no-
dynamic super
17,005,970 12,584,698 benchgc
178,234,917 51,375,412 brainless
279,851,195 47,928,843 brew
297,000,355 70,879,605 cd16sim
293,473,631 33,496,775 fcp
12,267,065 8,269,104 lexex
So the improvment in branch prediction provides speedup factors
between 1.06 and 1.81 on these benchmarks. Plus, the results for the
interpreters with a shared indirect branch (what the usual way of
writing a switch-based interpreter gives you) are even slower than for
the "non-shared --no-dynamic". So to paraphrase the title of Rohou et
al.'s paper, don't trust Rohou et al.
However, I expect that the indirect branch predictors have improved
since Haswell, though, and maybe we now see much lower numbers of
mispredictions even for "non-shared --no-dynamic" and for "shared
--no-dynamic", but I would have to measure that.
Anyway, back to Mitch Alsup's original claim: No, for a bytecode
interpreter the prediction accuracy of an indirect-branch predictor is
not necessarily going to be especially bad. It depends on how good
the indirect-branch predictor is, and on the benchmark.
@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {
https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {
http://www.jilp.org/vol5/v5paper12.pdf},
note = {
http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}
@InProceedings{rohou+15,
author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
title = {Branch Prediction and the Performance of
Interpreters --- Don't Trust Folklore},
booktitle = {Code Generation and Optimization (CGO)},
OPTpages = {},
year = {2015},
url = {
https://hal.inria.fr/hal-01100647/document},
annote = {Evaluates the indirect branch predictors of Nehalem,
Sandy Bridge, and Haswell and the ITTAGE predictor
on the interpreters of Python, Spidermonkey
(JavaScript), and a (.NET) CLI interpreter running a
number of benchmarks. Haswell and ITTAGE are very
good at branch prediction for these benchmarks, and
they suggest that switch-based interpreters good
enough because of that. However, my own results
\url{news:<
2015Sep7.142507@mips.complang.tuwien.ac.at>}
show that shared dispatch branches still can produce
a significant slowdown.}
}
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>