<br />
<b>Warning</b>:  mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in <b>D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php</b> on line <b>21</b><br />
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Retpoline cost
Date: Sat, 20 Mar 2021 22:26:23 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 74
Message-ID: <2021Mar20.232623@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="fb6eec1f2ee117b2cc0bba2859b93fff";
	logging-data="16198"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18FfND7k1zlWHzn0eTgNK4Z"
Cancel-Lock: sha1:SjHR+EmyBeJt+kqPlZ4OEyiBXZc=
X-newsreader: xrn 10.00-beta-3
Bytes: 4632

Retpolines are a workaroun for Spectre v2: they replace indirect
branches by returns to the same address, so they are not predicted by
indirect-branch predictors, but by the return predictor stack; this
always results in a misprediction, but it avoids the scenario where
the attcaker trains the predictor to predict a jump to code (a
"gadget") that can reveal interesting data through side channels.

The nice thing is that we can get our indirect branches replaced with
retpolines with very little effort these days, by using the gcc
options -mindirect-branch=thunk and -mfunction-return=thunk.  There
are different options instead of "thunk", but the effect is unclear
from the documentation.  The function-return option works around a
shortcoming of the Skylake where running out of return stack falls
back to the indirect branch predictor.

So I wanted to see how much performance retpolines cost.  I built
gforth with default (no retpolines) and with

../configure CC="gcc -mindirect-branch=thunk -mfunction-return=thunk"

The results (times in seconds) are as follows:

 sieve bubble matrix   fib   fft
 0.095  0.089  0.039 0.063 0.023 gforth-fast   no retpolines auf Ryzen 3900x
 0.780  0.663  0.647 0.923 0.416 gforth-fast with retpolines auf Ryzen 3900x
 0.092  0.124  0.052 0.080 0.032 gforth-fast   no retpolines auf Pentium G4560
 1.376  1.288  1.272 1.736 0.784 gforth-fast with retpolines auf Pentium G4560
 0.492  0.556  0.424 0.700 0.396 gforth-fast   no retpolines auf Intel Atom 330

I cannot do runs on the Atom 330 at the moment, so the results on Atom
330 are older.  Ryzen 3900X is a Zen 2, Pentium G4560 is a Skylake
(actually a Kaby Lake).

In any case, we see that retpolines slow Gforth down a lot, sometimes
by more than a factor of 20, and as a result is always slower than the
in-order Atom 330 (which needs no such workaround).  Admittedly,
Gforth is an extreme case: It has a very high proportion of indirect
branches [ertl&gregg03jilp].  Some other interpreters are not far
behind, though, and there are also object-oriented programs that do a
lot of indirect branching, so such slowdowns, while probably less
extreme, affects many if people actually use retpolines.

@Article{ertl&gregg03jilp,
  author =	 {M. Anton Ertl and David Gregg},
  title =	 {The Structure and Performance of \emph{Efficient}
                  Interpreters},
  journal =	 {The Journal of Instruction-Level Parallelism},
  year =	 {2003},
  volume =	 {5},
  month =	 nov,
  url =         {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
  url2 =	 {http://www.jilp.org/vol5/v5paper12.pdf},
  note =	 {http://www.jilp.org/vol5/},
  abstract =	 {Interpreters designed for high general-purpose
                  performance typically perform a large number of
                  indirect branches (3.2\%--13\% of all executed
                  instructions in our benchmarks). These branches
                  consume more than half of the run-time in a number
                  of configurations we simulated. We evaluate how
                  accurate various existing and proposed branch
                  prediction schemes are on a number of interpreters,
                  how the mispredictions affect the performance of the
                  interpreters and how two different interpreter
                  implementation techniques perform with various
                  branch predictors. We also suggest various ways in
                  which hardware designers, C compiler writers, and
                  interpreter writers can improve the performance of
                  interpreters.}
}

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>