Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Interpreters and indirect-branch prediction Date: Wed, 13 Nov 2024 08:20:27 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 157 Message-ID: <2024Nov13.092027@mips.complang.tuwien.ac.at> References: <8928500a87002966d6282465c037003e@www.novabbs.org> <18bfd4ce1048f68ad4d39d972c459e99@www.novabbs.org> Injection-Date: Wed, 13 Nov 2024 11:19:34 +0100 (CET) Injection-Info: dont-email.me; posting-host="9f14a901b6f5fa8b0359d5be2a9faa7f"; logging-data="2275087"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+CjlIfsmvxHf9po+HA5h5a" Cancel-Lock: sha1:QIqEYvsLgnj82peJTaMQvELCOpo= X-newsreader: xrn 10.11 Bytes: 8936 mitchalsup@aol.com (MitchAlsup1) writes: >For something like a bytecode interpreter, the prediction accuracy >of the jump predictor is going to be epically bad. And with "jump predictor", you probably mean the indirect-branch predictor. On hardware of up to around 2005 where the indirect branches were at best predicted by BTBs with 2-bit counters, a "bytecode interpreter" with one shared indirect branch (as generated by the usual for...{switch...{...}} idiom in C) results in misprediction rates of >80% [ertl&gregg03jilp]. With one a non-shared indirect branch per VM instruction implementation (as in typical threaded-code implementations), we measured 50%-61% mispredictions for BTB with 2-bit counters in simulation [ertl&gregg03jilp] and similar results on actual CPUs. We simulated 2-level history-based indirect branch predictors [ertl&gregg03jilp], and they typically have less than 10% mispredictions. Hardware designers improved indirect-branch predictors to deal with more history than BTBs with 2-bit counters already in Dothan (launched 2004) and it's descendents (in particular, the Core 2 line), judging by the results of (look in the column "switch"). In 2015 Rohou et al. measured the prediction accuracy in Python, JavaScript, and a CLI (.NET virtual machine, also known under other names) on Nehalem (2009, Sandy Bridge (2011), Haswell (2013), and a simulated TAGE1 branch predictor); unfortunately, they gave results in mispredictions per (kilo-)instructions rather than mispredictions per prediction and they used different interpreters, so you cannot compare that work with our earlier work. In any case, they found good improvements from Nehalem to Haswell. Inspired by this work, I did some measurements of Gforth, disabling various optimizations (which reduce the branch mispredictions) and posted them here. You can find the sequence of postings here: . However, I did not measure the prediction accuracy, because the idea was to check whether the claims made in the paper (e.g., "Modern predictors no longer need [superinstructions] to capture patterns in applications." or "[prediction accuracy] can not be considered as an obstacle for performance anymore.") are true. In the Haswell results, using an interpreter with non-shared indirect branches (non-shared --no-dynamic) does indeed often have a similar prediction accuracy as the interpreter with one shared indirect branch (shared --no-dynamic), but eliminating the sharing still provides a small speedup; and dynamic replication (--no-super), which results in having one indirect branch per instance of the VM instruction in the interpreted program, i.e., no sharing between different instances of the same VM instruction, provides a good reduction in branch mispredictions and in execution time on several benchmarks. And given that "non-shared --no-dynamic" and "--no-super" actually execute the same number and sequence of instructions (only with replication in the "--no-super" case), any improvement in run-time is only due to the improved branch prediction; here I only show these two columns, to avoid confusion; look at for the full data set: Run time in seconds: non-shared --no- --no- dynamic super 2.440 2.276 benchgc 1.524 1.064 brainless 3.416 2.288 brew 3.236 2.232 cd16sim 2.684 1.484 fcp 9.848 9.280 lexex For mispredictions the picture was as follows: non-shared --no- --no- dynamic super 17,005,970 12,584,698 benchgc 178,234,917 51,375,412 brainless 279,851,195 47,928,843 brew 297,000,355 70,879,605 cd16sim 293,473,631 33,496,775 fcp 12,267,065 8,269,104 lexex So the improvment in branch prediction provides speedup factors between 1.06 and 1.81 on these benchmarks. Plus, the results for the interpreters with a shared indirect branch (what the usual way of writing a switch-based interpreter gives you) are even slower than for the "non-shared --no-dynamic". So to paraphrase the title of Rohou et al.'s paper, don't trust Rohou et al. However, I expect that the indirect branch predictors have improved since Haswell, though, and maybe we now see much lower numbers of mispredictions even for "non-shared --no-dynamic" and for "shared --no-dynamic", but I would have to measure that. Anyway, back to Mitch Alsup's original claim: No, for a bytecode interpreter the prediction accuracy of an indirect-branch predictor is not necessarily going to be especially bad. It depends on how good the indirect-branch predictor is, and on the benchmark. @Article{ertl&gregg03jilp, author = {M. Anton Ertl and David Gregg}, title = {The Structure and Performance of \emph{Efficient} Interpreters}, journal = {The Journal of Instruction-Level Parallelism}, year = {2003}, volume = {5}, month = nov, url = {https://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz}, url2 = {http://www.jilp.org/vol5/v5paper12.pdf}, note = {http://www.jilp.org/vol5/}, abstract = {Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2\%--13\% of all executed instructions in our benchmarks). These branches consume more than half of the run-time in a number of configurations we simulated. We evaluate how accurate various existing and proposed branch prediction schemes are on a number of interpreters, how the mispredictions affect the performance of the interpreters and how two different interpreter implementation techniques perform with various branch predictors. We also suggest various ways in which hardware designers, C compiler writers, and interpreter writers can improve the performance of interpreters.} } @InProceedings{rohou+15, author = {Erven Rohou and Bharath Narasimha Swamy and Andr\'e Seznec}, title = {Branch Prediction and the Performance of Interpreters --- Don't Trust Folklore}, booktitle = {Code Generation and Optimization (CGO)}, OPTpages = {}, year = {2015}, url = {https://hal.inria.fr/hal-01100647/document}, annote = {Evaluates the indirect branch predictors of Nehalem, Sandy Bridge, and Haswell and the ITTAGE predictor on the interpreters of Python, Spidermonkey (JavaScript), and a (.NET) CLI interpreter running a number of benchmarks. Haswell and ITTAGE are very good at branch prediction for these benchmarks, and they suggest that switch-based interpreters good enough because of that. However, my own results \url{news:<2015Sep7.142507@mips.complang.tuwien.ac.at>} show that shared dispatch branches still can produce a significant slowdown.} } - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,