Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Retpoline cost
Date: Sat, 20 Mar 2021 22:26:23 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 74
Message-ID: <2021Mar20.232623@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="fb6eec1f2ee117b2cc0bba2859b93fff";
logging-data="16198"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18FfND7k1zlWHzn0eTgNK4Z"
Cancel-Lock: sha1:SjHR+EmyBeJt+kqPlZ4OEyiBXZc=
X-newsreader: xrn 10.00-beta-3
Bytes: 4632
Retpolines are a workaroun for Spectre v2: they replace indirect
branches by returns to the same address, so they are not predicted by
indirect-branch predictors, but by the return predictor stack; this
always results in a misprediction, but it avoids the scenario where
the attcaker trains the predictor to predict a jump to code (a
"gadget") that can reveal interesting data through side channels.
The nice thing is that we can get our indirect branches replaced with
retpolines with very little effort these days, by using the gcc
options -mindirect-branch=thunk and -mfunction-return=thunk. There
are different options instead of "thunk", but the effect is unclear
from the documentation. The function-return option works around a
shortcoming of the Skylake where running out of return stack falls
back to the indirect branch predictor.
So I wanted to see how much performance retpolines cost. I built
gforth with default (no retpolines) and with
../configure CC="gcc -mindirect-branch=thunk -mfunction-return=thunk"
The results (times in seconds) are as follows:
sieve bubble matrix fib fft
0.095 0.089 0.039 0.063 0.023 gforth-fast no retpolines auf Ryzen 3900x
0.780 0.663 0.647 0.923 0.416 gforth-fast with retpolines auf Ryzen 3900x
0.092 0.124 0.052 0.080 0.032 gforth-fast no retpolines auf Pentium G4560
1.376 1.288 1.272 1.736 0.784 gforth-fast with retpolines auf Pentium G4560
0.492 0.556 0.424 0.700 0.396 gforth-fast no retpolines auf Intel Atom 330
I cannot do runs on the Atom 330 at the moment, so the results on Atom
330 are older. Ryzen 3900X is a Zen 2, Pentium G4560 is a Skylake
(actually a Kaby Lake).
In any case, we see that retpolines slow Gforth down a lot, sometimes
by more than a factor of 20, and as a result is always slower than the
in-order Atom 330 (which needs no such workaround). Admittedly,
Gforth is an extreme case: It has a very high proportion of indirect
branches [ertl&gregg03jilp]. Some other interpreters are not far
behind, though, and there are also object-oriented programs that do a
lot of indirect branching, so such slowdowns, while probably less
extreme, affects many if people actually use retpolines.
@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
note = {http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}
- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup,