Deutsch English Français Italiano |
<2024Aug21.121312@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: number of registers Date: Wed, 21 Aug 2024 10:13:12 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 100 Message-ID: <2024Aug21.121312@mips.complang.tuwien.ac.at> References: <v98asi$rulo$1@dont-email.me> <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org> <v991kh$vu8g$1@dont-email.me> <e4352bad7240a6276e453226136ea0b3@www.novabbs.org> <va049n$2vnr7$1@dont-email.me> <a566ca0c8b5c41f402b60e8bac445e24@www.novabbs.org> <2024Aug20.090149@mips.complang.tuwien.ac.at> <a3a57791722f7c21c4218f5be6226e97@www.novabbs.org> <20240820204050.00003d56@yahoo.com> <48438024ccdbcc373e4cfa51d18066f5@www.novabbs.org> Injection-Date: Wed, 21 Aug 2024 13:47:56 +0200 (CEST) Injection-Info: dont-email.me; posting-host="610b2bfe0b10fb60c5ef8f925c413124"; logging-data="4052047"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/wXVzSwc3w1pMR1bTHyc+/" Cancel-Lock: sha1:3x/8Dqe9j775sJ2+uivaMo/bxPU= X-newsreader: xrn 10.11 Bytes: 5445 mitchalsup@aol.com (MitchAlsup1) writes: >The point is that the cost of not getting allocated into a register >is vastly lower--the count of instructions remains 1 while the >latency increases. That increase in latency does not hurt those >use once/seldom variables. Latency is not the issue in modern high-performance AMD64 cores, which have zero-cycle store-to-load forwarding <http://www.complang.tuwien.ac.at/anton/memdep/>. And yet, putting variables in registers gives a significant speedup: On a Rocket Lake, numbers are times in seconds: sieve bubble matrix fib fft 0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg 0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem In the first line, I used gforth-fast and tried to disable all optimizations except those that keep certain variables in registers: gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs I could not reduce the static superinstructions below 31 and still get a result; I will have to investigate why, but that probably does not make that much of a difference for several of these benchmarks. In the second line I used gforth, an engine that keeps the top of stack in memory, the return-stack pointer in memory, stores IP to memory after every change, and does not use static superinstructions, all for better identifying where an error happened. >The the examples cited, the lack of register allocation triples >the instruction count due to lack of LD-OP and LD-OP-ST. The >register count I stated is how many registers would a >non-LD-OP machine need to break even on the instruction count. What makes you think that instruction count is particularly relevant? Yes, you may save some decoding resources if you use LD-OP-ST on an architecture that supports it, but you first had to invest into a more complex decoder. And in the OoO engine the difference may be gone (at least on Intel CPUs). Consider the Forth program : squared dup * ; This results in the following code sequences for the two engines mentioned above: dup 1->1 dup 0->0 mov $50[r13],r15 add rbx,$08 add r15,$08 mov $00[r13],r8 mov rax,[r14] sub r13,$08 sub r14,$08 mov [r14],rax * 1->1 * 0->0 mov $50[r13],r15 add rbx,$08 add r15,$08 mov rax,$08[r14] imul r8,$08[r13] imul rax,[r14] add r13,$08 add r14,$08 mov [r14],rax ;s 1->1 ;s 0->0 mov $50[r13],r15 mov rax,$58[r13] mov rbx,[r14] mov r10,[rax] add r14,$08 add rax,$08 mov $58[r13],rax mov r15,r10 mov rax,[rbx] mov rcx,[r15] jmp rax jmp rcx TOS=r8, RP=r14, IP=rbx TOS=[r14], RP=$58[r13], IP=r15/$50[r13] The registers are allocated differently in the two engines; for the three things where the memory/register allocation differed, I have shown the allocation. One interesting case is the sequence 7FA02A77133D: mov rax,$58[r13] 7FA02A771341: mov r10,[rax] 7FA02A771344: add rax,$08 7FA02A771348: mov $58[r13],rax Sure you could use a load-op-store instruction for adding 8 to $58[r13], but the mov in 7FA02A771341 still needs the value in a register, so apparently gcc (which produced the code snippets for the individual Forth words above) decided that it's better to save execution resources rather than reduce the number of instructions (at a higher execution resource cost) by writing the code as mov rax,$58[r13] add $58[r13], $8 mov r10,[rax] - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>