Deutsch English Français Italiano |
<vqn2ro$1e8tr$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Brett <ggtgp@yahoo.com> Newsgroups: comp.arch Subject: Re: An execution time puzzle Date: Mon, 10 Mar 2025 16:09:28 -0000 (UTC) Organization: A noiseless patient Spider Lines: 110 Message-ID: <vqn2ro$1e8tr$1@dont-email.me> References: <2025Mar10.083318@mips.complang.tuwien.ac.at> <2025Mar10.095420@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Mon, 10 Mar 2025 17:09:28 +0100 (CET) Injection-Info: dont-email.me; posting-host="5bc61ddf582efb826098dbb44e548c43"; logging-data="1516475"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+52Mr/59DBhc0YbKrNvvdd" User-Agent: NewsTap/5.5 (iPad) Cancel-Lock: sha1:6BzZWFU02Yq9SeKt51VVjY9xlwQ= sha1:/Eu2Y1u/z2O6p8/0BHCpCmJ/6QY= Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: > anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >> I have the sequence >> >> 1 add $0x8,%rbx >> 2 sub $0x8,%r13 >> 3 mov %rbx,0x0(%r13) >> 4 mov %rdx,%rbx >> 5 mov (%rbx),%rax >> 6 jmp *%rax >> 7 mov %r8,%r15 >> 8 add $0x10,%rbx >> 9 mov 0x0(%r13),%rbx >> 10 mov -0x10(%r15),%rax >> 11 mov %r15,%rdx >> 12 add $0x8,%r13 >> 13 sub $0x8,%rbx >> 14 jmp *%rax >> >> The contents of the registers and memory are such that the first jmp >> continues at the next instruction in the sequence and the second jmp >> continues at the top of the sequence. I measure this sequence with >> perf stat on a Zen4, terminating it with Ctrl-C, and get output like: >> >> 21969657501 cycles >> 27996663866 instructions # 1.27 insn per cycle >> >> I.e., about 11 cycles for the whole sequence of 14 instructions. In >> trying to unserstand where these 11 cycles come from, I asked >> llvm-mca with >> >> cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000 >> >> and it tells me that it thinks that 1000 iterations take 2342 cycles: >> >> Iterations: 1000 >> Instructions: 14000 >> Total Cycles: 2342 >> Total uOps: 14000 >> >> Dispatch Width: 6 >> uOps Per Cycle: 5.98 >> IPC: 5.98 >> Block RThroughput: 2.3 >> >> So llvm-mca does not predict the actual performance correctly in this >> case and I still have no explanation for the 11 cycles. > > Even more puzzling: In order to experiment with removing instructions > I recreated this in assembly language: > > .text > .globl main > main: > mov $threaded, %rdx > mov $0, %rbx > mov $(returnstack+8),%r13 > mov %rdx, %r8 > docol: > add $0x8,%rbx > sub $0x8,%r13 > mov %rbx,0x0(%r13) > mov %rdx,%rbx > mov (%rbx),%rax > jmp *%rax > outout: > mov %r8,%r15 > add $0x10,%rbx > mov 0x0(%r13),%rbx > mov -0x10(%r15),%rax > mov %r15,%rdx > add $0x8,%r13 > sub $0x8,%rbx > jmp *%rax > > .data > .quad docol > .quad 0 > threaded: > .quad outout > returnstack: > .zero 16,0 > > I assembled and linked this with: > > gcc xxx.s -Wl,-no-pie > > I ran the result with > > perf stat -e cycles -e instructions a.out > > terminated it with Ctrl-C and the result is: > > 10764822288 cycles > 64556841216 instructions # 6.00 insn per cycle > > I.e., as predicted by llvm-mca. The main difference AFAICS is that in > the slow version docol and outout are not adjacent, but far from each > other, and returnstack is also not close to threaded (and the two > 64-bit words before it that also belong to threaded). > > It looks like I have found a microarchitectural pitfall, but it's not > clear what it is. > > - anton How about giving us the original source code function, my x86 is rusty and it is helpful to plug source into compiler explorer to see what different compilers do.