Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Interpreters and indirect-branch prediction
Date: Wed, 13 Nov 2024 08:20:27 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 157
Message-ID: <2024Nov13.092027@mips.complang.tuwien.ac.at>
References: <vfbfn0$256vo$1@dont-email.me> <c517f562a19a0db2f3d945a1c56ee2e6@www.novabbs.org> <jwv1q002k2s.fsf-monnier+comp.arch@gnu.org> <a3d81b5c64ce058ad21f42a8081162cd@www.novabbs.org> <jwvcyj1sefl.fsf-monnier+comp.arch@gnu.org> <abef7481ff0dd5d832cef0b9d3ea087a@www.novabbs.org> <jwv1pzhsahr.fsf-monnier+comp.arch@gnu.org> <8928500a87002966d6282465c037003e@www.novabbs.org> <jwvpln0qpel.fsf-monnier+comp.arch@gnu.org> <18bfd4ce1048f68ad4d39d972c459e99@www.novabbs.org>
Injection-Date: Wed, 13 Nov 2024 11:19:34 +0100 (CET)
Injection-Info: dont-email.me; posting-host="9f14a901b6f5fa8b0359d5be2a9faa7f";
	logging-data="2275087"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+CjlIfsmvxHf9po+HA5h5a"
Cancel-Lock: sha1:QIqEYvsLgnj82peJTaMQvELCOpo=
X-newsreader: xrn 10.11
Bytes: 8936

mitchalsup@aol.com (MitchAlsup1) writes:
>For something like a bytecode interpreter, the prediction accuracy
>of the jump predictor is going to be epically bad.

And with "jump predictor", you probably mean the indirect-branch
predictor.

On hardware of up to around 2005 where the indirect branches were at
best predicted by BTBs with 2-bit counters, a "bytecode interpreter"
with one shared indirect branch (as generated by the usual
for...{switch...{...}} idiom in C) results in misprediction rates of
>80% [ertl&gregg03jilp].  With one a non-shared indirect branch per VM
instruction implementation (as in typical threaded-code
implementations), we measured 50%-61% mispredictions for BTB with
2-bit counters in simulation [ertl&gregg03jilp] and similar results on
actual CPUs.

We simulated 2-level history-based indirect branch predictors
[ertl&gregg03jilp], and they typically have less than 10%
mispredictions.

Hardware designers improved indirect-branch predictors to deal with
more history than BTBs with 2-bit counters already in Dothan (launched
2004) and it's descendents (in particular, the Core 2 line), judging
by the results of <http://www.complang.tuwien.ac.at/forth/threading/>
(look in the column "switch").

In 2015 Rohou et al. measured the prediction accuracy in Python,
JavaScript, and a CLI (.NET virtual machine, also known under other
names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a
simulated TAGE1 branch predictor); unfortunately, they gave results in
mispredictions per (kilo-)instructions rather than mispredictions per
prediction and they used different interpreters, so you cannot compare
that work with our earlier work.  In any case, they found good
improvements from Nehalem to Haswell.

Inspired by this work, I did some measurements of Gforth, disabling
various optimizations (which reduce the branch mispredictions) and
posted them here.  You can find the sequence of postings here:
<http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>.
However, I did not measure the prediction accuracy, because the idea
was to check whether the claims made in the paper (e.g., "Modern
predictors no longer need [superinstructions] to capture patterns in
applications." or "[prediction accuracy] can not be considered as an
obstacle for performance anymore.") are true.

In the Haswell results, using an interpreter with non-shared indirect
branches (non-shared --no-dynamic) does indeed often have a similar
prediction accuracy as the interpreter with one shared indirect branch
(shared --no-dynamic), but eliminating the sharing still provides a
small speedup; and dynamic replication (--no-super), which results in
having one indirect branch per instance of the VM instruction in the
interpreted program, i.e., no sharing between different instances of
the same VM instruction, provides a good reduction in branch
mispredictions and in execution time on several benchmarks.  And given
that "non-shared --no-dynamic" and "--no-super" actually execute the
same number and sequence of instructions (only with replication in the
"--no-super" case), any improvement in run-time is only due to the
improved branch prediction; here I only show these two columns, to
avoid confusion; look at
<http://www.complang.tuwien.ac.at/anton/interpreter-branch-pred.txt>
for the full data set:

Run time in seconds:

non-shared
     --no-      --no-
   dynamic      super
     2.440      2.276 benchgc  
     1.524      1.064 brainless
     3.416      2.288 brew     
     3.236      2.232 cd16sim  
     2.684      1.484 fcp      
     9.848      9.280 lexex    

For mispredictions the picture was as follows:

 non-shared
      --no-       --no-
    dynamic       super
 17,005,970  12,584,698 benchgc  
178,234,917  51,375,412 brainless
279,851,195  47,928,843 brew     
297,000,355  70,879,605 cd16sim  
293,473,631  33,496,775 fcp      
 12,267,065   8,269,104 lexex    

So the improvment in branch prediction provides speedup factors
between 1.06 and 1.81 on these benchmarks.  Plus, the results for the
interpreters with a shared indirect branch (what the usual way of
writing a switch-based interpreter gives you) are even slower than for
the "non-shared --no-dynamic".  So to paraphrase the title of Rohou et
al.'s paper, don't trust Rohou et al.

However, I expect that the indirect branch predictors have improved
since Haswell, though, and maybe we now see much lower numbers of
mispredictions even for "non-shared --no-dynamic" and for "shared
--no-dynamic", but I would have to measure that.

Anyway, back to Mitch Alsup's original claim: No, for a bytecode
interpreter the prediction accuracy of an indirect-branch predictor is
not necessarily going to be especially bad.  It depends on how good
the indirect-branch predictor is, and on the benchmark.

@Article{ertl&gregg03jilp,
  author =	 {M. Anton Ertl and David Gregg},
  title =	 {The Structure and Performance of \emph{Efficient}
                  Interpreters},
  journal =	 {The Journal of Instruction-Level Parallelism},
  year =	 {2003},
  volume =	 {5},
  month =	 nov,
  url =         {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
  url2 =	 {http://www.jilp.org/vol5/v5paper12.pdf},
  note =	 {http://www.jilp.org/vol5/},
  abstract =	 {Interpreters designed for high general-purpose
                  performance typically perform a large number of
                  indirect branches (3.2\%--13\% of all executed
                  instructions in our benchmarks). These branches
                  consume more than half of the run-time in a number
                  of configurations we simulated. We evaluate how
                  accurate various existing and proposed branch
                  prediction schemes are on a number of interpreters,
                  how the mispredictions affect the performance of the
                  interpreters and how two different interpreter
                  implementation techniques perform with various
                  branch predictors. We also suggest various ways in
                  which hardware designers, C compiler writers, and
                  interpreter writers can improve the performance of
                  interpreters.}
}

@InProceedings{rohou+15,
  author = 	 {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec},
  title = 	 {Branch Prediction and the Performance of
                  Interpreters --- Don't Trust Folklore},
  booktitle =	 {Code Generation and Optimization (CGO)},
  OPTpages = 	 {},
  year =	 {2015},
  url = 	 {https://hal.inria.fr/hal-01100647/document},
  annote =	 {Evaluates the indirect branch predictors of Nehalem,
                  Sandy Bridge, and Haswell and the ITTAGE predictor
                  on the interpreters of Python, Spidermonkey
                  (JavaScript), and a (.NET) CLI interpreter running a
                  number of benchmarks.  Haswell and ITTAGE are very
                  good at branch prediction for these benchmarks, and
                  they suggest that switch-based interpreters good
                  enough because of that.  However, my own results
                  \url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>}
                  show that shared dispatch branches still can produce
                  a significant slowdown.}
}

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>