Deutsch English Français Italiano |
<va529m$1uo39$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Stephen Fuld <sfuld@alumni.cmu.edu.invalid> Newsgroups: comp.arch Subject: Re: number of registers Date: Wed, 21 Aug 2024 08:49:10 -0700 Organization: A noiseless patient Spider Lines: 53 Message-ID: <va529m$1uo39$1@dont-email.me> References: <v98asi$rulo$1@dont-email.me> <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org> <v991kh$vu8g$1@dont-email.me> <e4352bad7240a6276e453226136ea0b3@www.novabbs.org> <va049n$2vnr7$1@dont-email.me> <a566ca0c8b5c41f402b60e8bac445e24@www.novabbs.org> <2024Aug20.090149@mips.complang.tuwien.ac.at> <a3a57791722f7c21c4218f5be6226e97@www.novabbs.org> <20240820204050.00003d56@yahoo.com> <48438024ccdbcc373e4cfa51d18066f5@www.novabbs.org> <2024Aug21.121312@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 21 Aug 2024 17:49:11 +0200 (CEST) Injection-Info: dont-email.me; posting-host="3e6a1e730eaf8f568c20cfe9a6f00305"; logging-data="2056297"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+KbSJoqzfj35rcdnoS2gOeZ/0vCa6kyjA=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:RGzCvBISD4vuA6H5/rRzpdKlIRw= In-Reply-To: <2024Aug21.121312@mips.complang.tuwien.ac.at> Content-Language: en-US Bytes: 3878 On 8/21/2024 3:13 AM, Anton Ertl wrote: > mitchalsup@aol.com (MitchAlsup1) writes: >> The point is that the cost of not getting allocated into a register >> is vastly lower--the count of instructions remains 1 while the >> latency increases. That increase in latency does not hurt those >> use once/seldom variables. > > Latency is not the issue in modern high-performance AMD64 cores, which > have zero-cycle store-to-load forwarding > <http://www.complang.tuwien.ac.at/anton/memdep/>. > > And yet, putting variables in registers gives a significant speedup: > On a Rocket Lake, numbers are times in seconds: > > sieve bubble matrix fib fft > 0.075 0.070 0.036 0.049 0.017 TOS in reg, RP in reg, IP in reg > 0.100 0.149 0.054 0.106 0.037 TOS in mem, RP in mem, IP write-through to mem > > In the first line, I used gforth-fast and tried to disable all > optimizations except those that keep certain variables in registers: > > gforth-fast --ss-states=1 --ss-number=31 --opt-ip-updates=0 onebench.fs > > I could not reduce the static superinstructions below 31 and still get > a result; I will have to investigate why, but that probably does not > make that much of a difference for several of these benchmarks. > > In the second line I used gforth, an engine that keeps the top of > stack in memory, the return-stack pointer in memory, stores IP to > memory after every change, and does not use static superinstructions, > all for better identifying where an error happened. > >> The the examples cited, the lack of register allocation triples >> the instruction count due to lack of LD-OP and LD-OP-ST. The >> register count I stated is how many registers would a >> non-LD-OP machine need to break even on the instruction count. > > What makes you think that instruction count is particularly relevant? > Yes, you may save some decoding resources if you use LD-OP-ST on an > architecture that supports it, but you first had to invest into a more > complex decoder. And in the OoO engine the difference may be gone (at > least on Intel CPUs). There are also some savings in reduced I-cache usage (possibly leading to higher I-cache hit rate), reduced memory I-fetch memory bandwidth required, etc, though these may be modest at best. -- - Stephen Fuld (e-mail address disguised to prevent spam)