Article <vqn2ro$1e8tr$1@dont-email.me>

Deutsch English Français Italiano
<vqn2ro$1e8tr$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Brett <ggtgp@yahoo.com>
Newsgroups: comp.arch
Subject: Re: An execution time puzzle
Date: Mon, 10 Mar 2025 16:09:28 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 110
Message-ID: <vqn2ro$1e8tr$1@dont-email.me>
References: <2025Mar10.083318@mips.complang.tuwien.ac.at>
 <2025Mar10.095420@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 10 Mar 2025 17:09:28 +0100 (CET)
Injection-Info: dont-email.me; posting-host="5bc61ddf582efb826098dbb44e548c43";
	logging-data="1516475"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+52Mr/59DBhc0YbKrNvvdd"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:6BzZWFU02Yq9SeKt51VVjY9xlwQ=
	sha1:/Eu2Y1u/z2O6p8/0BHCpCmJ/6QY=

Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
> anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>> I have the sequence
>> 
>> 1	add    $0x8,%rbx
>> 2	sub    $0x8,%r13
>> 3	mov    %rbx,0x0(%r13)
>> 4	mov    %rdx,%rbx
>> 5	mov    (%rbx),%rax
>> 6	jmp    *%rax
>> 7	mov    %r8,%r15
>> 8	add    $0x10,%rbx
>> 9	mov    0x0(%r13),%rbx
>> 10	mov    -0x10(%r15),%rax
>> 11	mov    %r15,%rdx
>> 12	add    $0x8,%r13
>> 13	sub    $0x8,%rbx
>> 14	jmp    *%rax
>> 
>> The contents of the registers and memory are such that the first jmp
>> continues at the next instruction in the sequence and the second jmp
>> continues at the top of the sequence.  I measure this sequence with
>> perf stat on a Zen4, terminating it with Ctrl-C, and get output like:
>> 
>> 21969657501      cycles
>> 27996663866      instructions  #    1.27  insn per cycle
>> 
>> I.e., about 11 cycles for the whole sequence of 14 instructions.  In
>> trying to unserstand where these 11 cycles come from, I asked
>> llvm-mca with
>> 
>> cat xxx.s|llvm-mca-16 -mcpu=znver4 --iterations=1000
>> 
>> and it tells me that it thinks that 1000 iterations take 2342 cycles:
>> 
>> Iterations:        1000
>> Instructions:      14000
>> Total Cycles:      2342
>> Total uOps:        14000
>> 
>> Dispatch Width:    6
>> uOps Per Cycle:    5.98
>> IPC:               5.98
>> Block RThroughput: 2.3
>> 
>> So llvm-mca does not predict the actual performance correctly in this
>> case and I still have no explanation for the 11 cycles.
> 
> Even more puzzling: In order to experiment with removing instructions
> I recreated this in assembly language:
> 
>         .text
>         .globl main
> main:
>         mov $threaded, %rdx
>         mov $0, %rbx
>         mov $(returnstack+8),%r13
>         mov %rdx, %r8
> docol:   
>         add    $0x8,%rbx
>         sub    $0x8,%r13
>         mov    %rbx,0x0(%r13)
>         mov    %rdx,%rbx
>         mov    (%rbx),%rax
>         jmp    *%rax
> outout:
>         mov    %r8,%r15
>         add    $0x10,%rbx
>         mov    0x0(%r13),%rbx
>         mov    -0x10(%r15),%rax
>         mov    %r15,%rdx
>         add    $0x8,%r13
>         sub    $0x8,%rbx
>         jmp    *%rax
> 
>         .data
>         .quad docol
>         .quad 0
> threaded:
>         .quad outout
> returnstack:
>         .zero 16,0
> 
> I assembled and linked this with:
> 
> gcc xxx.s -Wl,-no-pie
> 
> I ran the result with
> 
> perf stat -e cycles -e instructions a.out
> 
> terminated it with Ctrl-C and the result is:
> 
> 10764822288      cycles
> 64556841216      instructions #    6.00  insn per cycle 
> 
> I.e., as predicted by llvm-mca.  The main difference AFAICS is that in
> the slow version docol and outout are not adjacent, but far from each
> other, and returnstack is also not close to threaded (and the two
> 64-bit words before it that also belong to threaded).
> 
> It looks like I have found a microarchitectural pitfall, but it's not
> clear what it is.
> 
> - anton

How about giving us the original source code function, my x86 is rusty and
it is helpful to plug source into compiler explorer to see what different
compilers do.