Path: ...!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Retpoline cost Date: Sat, 20 Mar 2021 22:26:23 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 74 Message-ID: <2021Mar20.232623@mips.complang.tuwien.ac.at> Injection-Info: reader02.eternal-september.org; posting-host="fb6eec1f2ee117b2cc0bba2859b93fff"; logging-data="16198"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FfND7k1zlWHzn0eTgNK4Z" Cancel-Lock: sha1:SjHR+EmyBeJt+kqPlZ4OEyiBXZc= X-newsreader: xrn 10.00-beta-3 Bytes: 4632 Retpolines are a workaroun for Spectre v2: they replace indirect branches by returns to the same address, so they are not predicted by indirect-branch predictors, but by the return predictor stack; this always results in a misprediction, but it avoids the scenario where the attcaker trains the predictor to predict a jump to code (a "gadget") that can reveal interesting data through side channels. The nice thing is that we can get our indirect branches replaced with retpolines with very little effort these days, by using the gcc options -mindirect-branch=thunk and -mfunction-return=thunk. There are different options instead of "thunk", but the effect is unclear from the documentation. The function-return option works around a shortcoming of the Skylake where running out of return stack falls back to the indirect branch predictor. So I wanted to see how much performance retpolines cost. I built gforth with default (no retpolines) and with ../configure CC="gcc -mindirect-branch=thunk -mfunction-return=thunk" The results (times in seconds) are as follows: sieve bubble matrix fib fft 0.095 0.089 0.039 0.063 0.023 gforth-fast no retpolines auf Ryzen 3900x 0.780 0.663 0.647 0.923 0.416 gforth-fast with retpolines auf Ryzen 3900x 0.092 0.124 0.052 0.080 0.032 gforth-fast no retpolines auf Pentium G4560 1.376 1.288 1.272 1.736 0.784 gforth-fast with retpolines auf Pentium G4560 0.492 0.556 0.424 0.700 0.396 gforth-fast no retpolines auf Intel Atom 330 I cannot do runs on the Atom 330 at the moment, so the results on Atom 330 are older. Ryzen 3900X is a Zen 2, Pentium G4560 is a Skylake (actually a Kaby Lake). In any case, we see that retpolines slow Gforth down a lot, sometimes by more than a factor of 20, and as a result is always slower than the in-order Atom 330 (which needs no such workaround). Admittedly, Gforth is an extreme case: It has a very high proportion of indirect branches [ertl&gregg03jilp]. Some other interpreters are not far behind, though, and there are also object-oriented programs that do a lot of indirect branching, so such slowdowns, while probably less extreme, affects many if people actually use retpolines. @Article{ertl&gregg03jilp, author = {M. Anton Ertl and David Gregg}, title = {The Structure and Performance of \emph{Efficient} Interpreters}, journal = {The Journal of Instruction-Level Parallelism}, year = {2003}, volume = {5}, month = nov, url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz}, url2 = {http://www.jilp.org/vol5/v5paper12.pdf}, note = {http://www.jilp.org/vol5/}, abstract = {Interpreters designed for high general-purpose performance typically perform a large number of indirect branches (3.2\%--13\% of all executed instructions in our benchmarks). These branches consume more than half of the run-time in a number of configurations we simulated. We evaluate how accurate various existing and proposed branch prediction schemes are on a number of interpreters, how the mispredictions affect the performance of the interpreters and how two different interpreter implementation techniques perform with various branch predictors. We also suggest various ways in which hardware designers, C compiler writers, and interpreter writers can improve the performance of interpreters.} } - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,