Article <2025Mar11.091817@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2025Mar11.091817@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: An execution time puzzle
Date: Tue, 11 Mar 2025 08:18:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 136
Message-ID: <2025Mar11.091817@mips.complang.tuwien.ac.at>
References: <2025Mar10.083318@mips.complang.tuwien.ac.at> <2025Mar10.095420@mips.complang.tuwien.ac.at> <2025Mar10.181427@mips.complang.tuwien.ac.at>
Injection-Date: Tue, 11 Mar 2025 10:46:05 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c8a9ea49f90bdbf4e8fcda9a2dd6bc1f";
	logging-data="2016925"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18lxY7UsOpvMbwd7JLdGPA4"
Cancel-Lock: sha1:eIXgGW97R7GcUiLv46X0i8XavRc=
X-newsreader: xrn 10.11
Bytes: 6603

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>And here are measurements with the gcc-10 build on various other
>microarchitectures (IPC=14/(c/it)); lower c/it numbers are better.
>
>cyc/it
>gf   as
> 8   2.3  Zen4
> 8   3    Zen3
> 4   3    Zen2
> 9   9    Zen
> 2.4 2.4  Golden Cove
> 3        Rocket Lake
> 6   3    Gracemont
>10.6      Tremont
>
>It's interesting that several microarchitectures show a difference
>between the version of the code produced by gforth-fast (gf) and my
>assembly-language variant (as) that executes the same instruction
>sequences.

Given that I have troubles reproducing the slowness in gforth-fast with
assembly language, I took another approach: The Forth source code is:

: foo dup execute-exit ;

So I added a primitive for the combination of DUP and EXECUTE-;S.
This allows exploring the difference between dynamically-generated and
static native code in Gforth.  Here are the different code sequences:

In all versions, the same static docol sequence is used

add    $0x8,%rbx
sub    $0x8,%r14
mov    %rbx,(%r14)
mov    %rdx,%rbx
mov    (%rbx),%rax
jmp    *%rax

For FOO, there are the following different sequences:

1) dynamic code for "dup execute-exit" (sequence)
2) dynamic code for "dup-execute-exit" (primitive)
3) static code for  "dup-execute-exit" (primitive)

dynamic sequence       dynamic primitive     static primitive
mov %r8,%r15          
add $0x10,%rbx         add $0x8,%rbx      
mov (%r14),%rbx        mov (%r14),%rbx       mov (%r14),%rbx    
mov -0x10(%r15),%rax   mov -0x10(%r8),%rax   mov -0x10(%r8),%rax
mov %r15,%rdx          mov %r8,%rdx          mov %r8,%rdx       
add $0x8,%r14          add $0x8,%r14         add $0x8,%r14      
sub $0x8,%rbx          sub $0x8,%rbx         sub $0x8,%rbx      
jmp *%rax              jmp *%rax             jmp *%rax          

To eliminate the difference between the dynamic and static primitive
variants, I also measured a variant where I manually arranged the
dynamic code to not execute the "add" at the start:

4) static-like dynamic code for "dup-execute-exit" (primitive)

I measured this on a Zen3, which has a similar difference between the
Gforth code and the assembly-language code as the Zen4.  The results are:

c/it
8    1) dynamic sequence
8    2) dynamic primitive
2    3) static primitive
8    4) static-like dynamic primitive
3    5) 4) with dynamic docol (see below)
2    6) 5) with aligned dynamic docol (see below)

So apparently the difference between static code and dynamic code
causes the slowdown on Zen3 (and probably on Zen4).

5) One reason could be that the dynamic code is far away in the address
space from the static code of the docol.  E.g., in one execution of 4)
the code for docol starts at 0x00005558a3b5eac3 and the code for the
dup-execute-exit starts at 0x00007f937beae764.  In order to test this
theory, I copied the docol code right behind the dup-execute-exit code
and made the pointer to docol point to it.  And indeed, the speed
increased to 3 cycles/iteration.

So the distance plays a role in Zen3 and probably others; I guess they
do not store the full length of the target in the L1 BTB, and such a
far branch therefore is never promoted to the L1 BTB; the branch
therefore uses the L2 BTB and takes several cycles.

6) There is still one cycle/iteration of difference between 3) and 5),
but I guess this can be explained with the usual sources of
variations, such as code alignment variations.  I tried this theory by
aligning the copied docol code to a 32-byte boundary.  And that indeed
produced 2 cycles/iteration.

Another open issue is that the gcc-12 build of gforth-fast (using r13
instead of r14) is 3 cycles slower than the gcc-10 build.  I don't see
an extension of my BTB theory that would explain this.  So either my
BTB theory is wrong or there is another effect at work.

Here's how you can reproduce this:

For adding the primitive, I added

dup-execute-;s ( xt R:w -- xt )	gforth-internal dup_execute_semis
SET_IP((Xt *)w);
SUPER_END;
VM_JUMP(EXEC1(xt));

to the file prim in Gforth (commit
d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with
gcc-10.2.1).

The measurements were then done on a Ryzen 5800X with:

1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo"

2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo"

4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo"

5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo"

6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo"

This code always ends in an endless loop, so I pressed Ctrl-C after a
second or so, and then computed

(cycles/instructions)*(instructions/iteration)

where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>