Deutsch English Français Italiano |
<2025Mar11.091817@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: An execution time puzzle Date: Tue, 11 Mar 2025 08:18:17 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 136 Message-ID: <2025Mar11.091817@mips.complang.tuwien.ac.at> References: <2025Mar10.083318@mips.complang.tuwien.ac.at> <2025Mar10.095420@mips.complang.tuwien.ac.at> <2025Mar10.181427@mips.complang.tuwien.ac.at> Injection-Date: Tue, 11 Mar 2025 10:46:05 +0100 (CET) Injection-Info: dont-email.me; posting-host="c8a9ea49f90bdbf4e8fcda9a2dd6bc1f"; logging-data="2016925"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18lxY7UsOpvMbwd7JLdGPA4" Cancel-Lock: sha1:eIXgGW97R7GcUiLv46X0i8XavRc= X-newsreader: xrn 10.11 Bytes: 6603 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >And here are measurements with the gcc-10 build on various other >microarchitectures (IPC=14/(c/it)); lower c/it numbers are better. > >cyc/it >gf as > 8 2.3 Zen4 > 8 3 Zen3 > 4 3 Zen2 > 9 9 Zen > 2.4 2.4 Golden Cove > 3 Rocket Lake > 6 3 Gracemont >10.6 Tremont > >It's interesting that several microarchitectures show a difference >between the version of the code produced by gforth-fast (gf) and my >assembly-language variant (as) that executes the same instruction >sequences. Given that I have troubles reproducing the slowness in gforth-fast with assembly language, I took another approach: The Forth source code is: : foo dup execute-exit ; So I added a primitive for the combination of DUP and EXECUTE-;S. This allows exploring the difference between dynamically-generated and static native code in Gforth. Here are the different code sequences: In all versions, the same static docol sequence is used add $0x8,%rbx sub $0x8,%r14 mov %rbx,(%r14) mov %rdx,%rbx mov (%rbx),%rax jmp *%rax For FOO, there are the following different sequences: 1) dynamic code for "dup execute-exit" (sequence) 2) dynamic code for "dup-execute-exit" (primitive) 3) static code for "dup-execute-exit" (primitive) dynamic sequence dynamic primitive static primitive mov %r8,%r15 add $0x10,%rbx add $0x8,%rbx mov (%r14),%rbx mov (%r14),%rbx mov (%r14),%rbx mov -0x10(%r15),%rax mov -0x10(%r8),%rax mov -0x10(%r8),%rax mov %r15,%rdx mov %r8,%rdx mov %r8,%rdx add $0x8,%r14 add $0x8,%r14 add $0x8,%r14 sub $0x8,%rbx sub $0x8,%rbx sub $0x8,%rbx jmp *%rax jmp *%rax jmp *%rax To eliminate the difference between the dynamic and static primitive variants, I also measured a variant where I manually arranged the dynamic code to not execute the "add" at the start: 4) static-like dynamic code for "dup-execute-exit" (primitive) I measured this on a Zen3, which has a similar difference between the Gforth code and the assembly-language code as the Zen4. The results are: c/it 8 1) dynamic sequence 8 2) dynamic primitive 2 3) static primitive 8 4) static-like dynamic primitive 3 5) 4) with dynamic docol (see below) 2 6) 5) with aligned dynamic docol (see below) So apparently the difference between static code and dynamic code causes the slowdown on Zen3 (and probably on Zen4). 5) One reason could be that the dynamic code is far away in the address space from the static code of the docol. E.g., in one execution of 4) the code for docol starts at 0x00005558a3b5eac3 and the code for the dup-execute-exit starts at 0x00007f937beae764. In order to test this theory, I copied the docol code right behind the dup-execute-exit code and made the pointer to docol point to it. And indeed, the speed increased to 3 cycles/iteration. So the distance plays a role in Zen3 and probably others; I guess they do not store the full length of the target in the L1 BTB, and such a far branch therefore is never promoted to the L1 BTB; the branch therefore uses the L2 BTB and takes several cycles. 6) There is still one cycle/iteration of difference between 3) and 5), but I guess this can be explained with the usual sources of variations, such as code alignment variations. I tried this theory by aligning the copied docol code to a 32-byte boundary. And that indeed produced 2 cycles/iteration. Another open issue is that the gcc-12 build of gforth-fast (using r13 instead of r14) is 3 cycles slower than the gcc-10 build. I don't see an extension of my BTB theory that would explain this. So either my BTB theory is wrong or there is another effect at work. Here's how you can reproduce this: For adding the primitive, I added dup-execute-;s ( xt R:w -- xt ) gforth-internal dup_execute_semis SET_IP((Xt *)w); SUPER_END; VM_JUMP(EXEC1(xt)); to the file prim in Gforth (commit d96c5dba9343e2b331e183b0594b6ee1622808f7) and rebuilt it (with gcc-10.2.1). The measurements were then done on a Ryzen 5800X with: 1) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup execute-;s ; ' foo foo" 2) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo" 3) perf stat -e cycles -e instructions ./gforth-fast --no-dynamic -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo foo" 4) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo foo" 5) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ tuck 20 move ' foo -2 cells + ! ' foo foo" 6) perf stat -e cycles -e instructions ./gforth-fast -e "' disasm-gdb is discode : foo dup-execute-;s ; ' foo @ 4 + ' foo ! ' foo -2 cells + @ ' foo cell+ @ 32 naligned tuck 20 move ' foo -2 cells + ! ' foo foo" This code always ends in an endless loop, so I pressed Ctrl-C after a second or so, and then computed (cycles/instructions)*(instructions/iteration) where instructions/iteration is 14 for 1), 13 for 2) and 12 for the others. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>