Path: ...!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Retpoline cost Date: Sun, 21 Mar 2021 16:00:39 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 46 Message-ID: <2021Mar21.170039@mips.complang.tuwien.ac.at> References: <2021Mar20.232623@mips.complang.tuwien.ac.at> Injection-Info: reader02.eternal-september.org; posting-host="fb6eec1f2ee117b2cc0bba2859b93fff"; logging-data="19472"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+NM1/QM18L2Ud3x3u3wiBk" Cancel-Lock: sha1:rr9OAQiuI1jiCPhZae5afF1vFe0= X-newsreader: xrn 10.00-beta-3 Bytes: 3043 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >The nice thing is that we can get our indirect branches replaced with >retpolines with very little effort these days, by using the gcc >options -mindirect-branch=thunk and -mfunction-return=thunk. There >are different options instead of "thunk", but the effect is unclear >from the documentation. I tried the "thunk-inline" option instead: ../configure CC="gcc -mindirect-branch=thunk-inline -mfunction-return=thunk-inline" It gives significantly faster results (times in seconds): sieve bubble matrix fib fft 0.095 0.089 0.039 0.063 0.023 gforth-fast no retpolines Ryzen 3900x 0.230 0.210 0.081 0.370 0.175 gforth-fast thunk-inline Ryzen 3900x 0.769 0.674 0.649 0.939 0.423 gforth-fast --no-dynamic thunk-inline 3900x 0.780 0.663 0.647 0.923 0.416 gforth-fast thunk Ryzen 3900x 0.092 0.124 0.052 0.080 0.032 gforth-fast no retpolines Pentium G4560 0.384 0.352 0.120 0.624 0.304 gforth-fast thunk-inline Pentium G4560 1.376 1.288 1.272 1.736 0.784 gforth-fast thunk Pentium G4560 0.492 0.556 0.424 0.700 0.396 gforth-fast no retpolines Intel Atom 330 The reason for the performance difference between thunk-inline and thunk is that thunk disables the dynamic superinstruction optimization of Gforth, while thunk-inline does not; dynamic superinstructions reduce the number of indirect branches performed by Gforth, typically by a factor of 3, but in the case of matrix quite a bit more. By disabling dynamic superinstructions with the Gforth command-line option --no-dyamic, we see that thunk-inline has a per-indirect branch cost that's similar to thunk. A typical example of a retpoline from using these two options (for an branch to the address in %rcx) is: 0x000055acfcb19b87: callq 0x55acfcb19b93 0x000055acfcb19b8c: pause 0x000055acfcb19b8e: lfence 0x000055acfcb19b91: jmp 0x55acfcb19b8c 0x000055acfcb19b93: mov %rcx,(%rsp) 0x000055acfcb19b97: retq - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,