Deutsch English Français Italiano |
<2025Mar10.095420@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: An execution time puzzle Date: Mon, 10 Mar 2025 08:54:20 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 107 Message-ID: <2025Mar10.095420@mips.complang.tuwien.ac.at> References: <2025Mar10.083318@mips.complang.tuwien.ac.at> Injection-Date: Mon, 10 Mar 2025 10:04:42 +0100 (CET) Injection-Info: dont-email.me; posting-host="9f8e121acd1049a94952a1c3822533e3"; logging-data="1341423"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19TQD+XndgS5lnXZFiJ00Kk" Cancel-Lock: sha1:9mQOwpqDZNb8GZ5jPXG18x3RBpI= X-newsreader: xrn 10.11 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >I have the sequence > > 1 add $0x8,%rbx > 2 sub $0x8,%r13 > 3 mov %rbx,0x0(%r13) > 4 mov %rdx,%rbx > 5 mov (%rbx),%rax > 6 jmp *%rax > 7 mov %r8,%r15 > 8 add $0x10,%rbx > 9 mov 0x0(%r13),%rbx > 10 mov -0x10(%r15),%rax > 11 mov %r15,%rdx > 12 add $0x8,%r13 > 13 sub $0x8,%rbx > 14 jmp *%rax > >The contents of the registers and memory are such that the first jmp >continues at the next instruction in the sequence and the second jmp >continues at the top of the sequence. I measure this sequence with >perf stat on a Zen4, terminating it with Ctrl-C, and get output like: > > 21969657501 cycles > 27996663866 instructions # 1.27 insn per cycle > >I.e., about 11 cycles for the whole sequence of 14 instructions. In >trying to unserstand where these 11 cycles come from, I asked >llvm-mca with > >cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000 > >and it tells me that it thinks that 1000 iterations take 2342 cycles: > >Iterations: 1000 >Instructions: 14000 >Total Cycles: 2342 >Total uOps: 14000 > >Dispatch Width: 6 >uOps Per Cycle: 5.98 >IPC: 5.98 >Block RThroughput: 2.3 > >So llvm-mca does not predict the actual performance correctly in this >case and I still have no explanation for the 11 cycles. Even more puzzling: In order to experiment with removing instructions I recreated this in assembly language: .text .globl main main: mov $threaded, %rdx mov $0, %rbx mov $(returnstack+8),%r13 mov %rdx, %r8 docol: add $0x8,%rbx sub $0x8,%r13 mov %rbx,0x0(%r13) mov %rdx,%rbx mov (%rbx),%rax jmp *%rax outout: mov %r8,%r15 add $0x10,%rbx mov 0x0(%r13),%rbx mov -0x10(%r15),%rax mov %r15,%rdx add $0x8,%r13 sub $0x8,%rbx jmp *%rax .data .quad docol .quad 0 threaded: .quad outout returnstack: .zero 16,0 I assembled and linked this with: gcc xxx.s -Wl,-no-pie I ran the result with perf stat -e cycles -e instructions a.out terminated it with Ctrl-C and the result is: 10764822288 cycles 64556841216 instructions # 6.00 insn per cycle I.e., as predicted by llvm-mca. The main difference AFAICS is that in the slow version docol and outout are not adjacent, but far from each other, and returnstack is also not close to threaded (and the two 64-bit words before it that also belong to threaded). It looks like I have found a microarchitectural pitfall, but it's not clear what it is. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>