<br />
<b>Warning</b>:  mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in <b>D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php</b> on line <b>21</b><br />
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Retpoline cost
Date: Sun, 21 Mar 2021 16:00:39 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 46
Message-ID: <2021Mar21.170039@mips.complang.tuwien.ac.at>
References: <2021Mar20.232623@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="fb6eec1f2ee117b2cc0bba2859b93fff";
	logging-data="19472"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+NM1/QM18L2Ud3x3u3wiBk"
Cancel-Lock: sha1:rr9OAQiuI1jiCPhZae5afF1vFe0=
X-newsreader: xrn 10.00-beta-3
Bytes: 3043

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>The nice thing is that we can get our indirect branches replaced with
>retpolines with very little effort these days, by using the gcc
>options -mindirect-branch=thunk and -mfunction-return=thunk.  There
>are different options instead of "thunk", but the effect is unclear
>from the documentation.

I tried the "thunk-inline" option instead:

../configure CC="gcc -mindirect-branch=thunk-inline -mfunction-return=thunk-inline"

It gives significantly faster results (times in seconds):

sieve bubble matrix   fib   fft
0.095  0.089  0.039 0.063 0.023 gforth-fast no retpolines Ryzen 3900x
0.230  0.210  0.081 0.370 0.175 gforth-fast thunk-inline  Ryzen 3900x
0.769  0.674  0.649 0.939 0.423 gforth-fast --no-dynamic thunk-inline 3900x
0.780  0.663  0.647 0.923 0.416 gforth-fast thunk         Ryzen 3900x
0.092  0.124  0.052 0.080 0.032 gforth-fast no retpolines Pentium G4560
0.384  0.352  0.120 0.624 0.304 gforth-fast thunk-inline  Pentium G4560
1.376  1.288  1.272 1.736 0.784 gforth-fast thunk         Pentium G4560
0.492  0.556  0.424 0.700 0.396 gforth-fast no retpolines Intel Atom 330

The reason for the performance difference between thunk-inline and
thunk is that thunk disables the dynamic superinstruction optimization
of Gforth, while thunk-inline does not; dynamic superinstructions
reduce the number of indirect branches performed by Gforth, typically
by a factor of 3, but in the case of matrix quite a bit more.  By
disabling dynamic superinstructions with the Gforth command-line
option --no-dyamic, we see that thunk-inline has a per-indirect branch
cost that's similar to thunk.

A typical example of a retpoline from using these two options (for an
branch to the address in %rcx) is:

0x000055acfcb19b87:      callq  0x55acfcb19b93
0x000055acfcb19b8c:      pause  
0x000055acfcb19b8e:      lfence 
0x000055acfcb19b91:      jmp    0x55acfcb19b8c
0x000055acfcb19b93:      mov    %rcx,(%rsp)
0x000055acfcb19b97:      retq   

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>