Deutsch English Français Italiano |
<v9dnmv$3efnj$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: My 66000 and High word facility Date: Mon, 12 Aug 2024 14:27:22 -0500 Organization: A noiseless patient Spider Lines: 327 Message-ID: <v9dnmv$3efnj$1@dont-email.me> References: <v98asi$rulo$1@dont-email.me> <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org> <v991kh$vu8g$1@dont-email.me> <2024Aug11.163333@mips.complang.tuwien.ac.at> <v9ath5$2qgnb$1@dont-email.me> <2024Aug12.082936@mips.complang.tuwien.ac.at> <130df049c4c97984986767736b5b037a@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Mon, 12 Aug 2024 21:27:28 +0200 (CEST) Injection-Info: dont-email.me; posting-host="fdd5012966402c1d824c86f16771acca"; logging-data="3620595"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+v55FKqq9dEMrFVMWRZK7Hirx6BuF3U9E=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:42hCswvVtoR/d5cozArM6I+4QSs= In-Reply-To: <130df049c4c97984986767736b5b037a@www.novabbs.org> Content-Language: en-US Bytes: 10511 On 8/12/2024 12:36 PM, MitchAlsup1 wrote: > On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote: > >> Brett <ggtgp@yahoo.com> writes: >>> Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >>>> Brett <ggtgp@yahoo.com> writes: >>>>> The lack of CPU’s with 64 registers is what makes for a market, >>>>> that 4% >>>>> that could benefit have no options to pick from. >>>> >>>> They had: >>>> >>>> SPARC: Ok, only 32 GPRs available at a time, but more in hardware >>>> through the Window mechanism. >>>> >>>> AMD29K: IIRC a 128-register stack and 64 additional registers >>>> >>>> IA-64: 128 GPRs and 128 FPRs with register stack and rotating register >>>> files to make good use of them. >>> >>> All antiques no longer available. >> >> SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says: >> >> |Fujitsu will also discontinue their SPARC production [...] end-of-sale >> |in 2029, of UNIX servers and a year later for their mainframe. >> >> No word of when Oracle will discontinue (or has discontinued) sales, >> but both companies introduced their last SPARC CPUs in 2017. >> >> In any case, my point still stands: these architectures were >> available, and the large number of registers failed to give them a >> decisive advantage. Maybe it even gave them a decisive disadvantage: >> AMD29K and IA-64 never had OoO implementations, and SPARC got them >> only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in >> 2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and >> Power and Alpha switched in 1998 (POWER3, 21264). >> >>>> Where is your 4% number coming from? >>> >>> The 4% number is poor memory and a guess. >>> Here is an antique paper on the issue: >>> >>> https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf >> >> Interesting. I only skimmed the paper, but I read a lot about >> inlining and interprocedural register allocation. SPARCs register >> windows and AMD29K's and IA-64's register stacks were intended to be >> useful for that, but somehow the other architectures did not suffer a >> big-enough disadvantage to make them adopt one of these concepts, and >> that's despite register windows/stacks working even for indirect calls >> (e.g., method calls in the general case), where interprocedural >> register allocation or inlining don't help. >> >> It seems to me that with OoO the cycle cost of spilling and refilling >> on call boundaries was lowered: the spills can be delayed until the >> computation is complete, and the refills can start early because the >> stack pointer tends to be available early. >> >> And recent OoO CPUs even have zero-cycle store-to-load forwarding, so >> even if the called function is short, the spilling and refilling >> around it (if any) does not increase the latency of the value that's >> spilled and refilled. But that consideration is only relevant for >> Intel APX, ARM A64 and RISC-V went for 32 registers several years >> before zero-cycle store-to-load-forwarding was implemented. >> >> One other optimization that they use the additional registers for is >> "register promotion", i.e., putting values from memory into registers >> for a while (if absence of aliasing can be proven). One interesting >> aspect here is that register promotion with 64 or 256 registers (RP-64 >> and RP-256) is usually not much better (if better at all) than >> register promotion with 32 registers (RP-32); see Figure 1. So >> register promotion does not make a strong case for more registers, >> either, at least in this paper. > > With full access to constants, there is even less need to promote > addresses or immediates into registers as you can simply poof them > up anything you want one. There are tradeoffs still, if constants need space to encode... Inline is still better than a memory load, granted. May make sense to consolidate multiple uses of a value into a register rather than try encoding them as an immediate each time. .... For example, when I was working on adding the code to display HDR pixels to the screen (need conversion to RGB555). First attempt: TKGDI_CopyPixelSpan_GetRGB24H: MOVU.L (R4), R6 PSHUF.W R5, 0x00, R20 //word shuffle PSHUF.W R5, 0x55, R21 PLDCM8UH R6, R16 //FP8U to Binary16 PMUL.H R16, R20, R16 //Scale PADD.H R16, R21, R16 //Bias MOV 0x3C003C003C003C00, R17 // 4x 1.0 PADD.H R16, R17, R18 // Map to 1.0 .. 1.999 TSTQ 0x0000C000C000C000, R18 BF .L1 .L0: MOV 0xFFFF000000000000, R7 //alpha ones PCVTH2UW R18, R19 //Convert to packed word OR R19, R7, R5 //Set alpha all ones RGB5PCK64 R5, R2 //convert to RGB555 RTS .L1: TSTQ 0x00000000C000, R18 AND?F 0xFFFFFFFF0000, R18 OR?F 0x000000003BFF, R18 TSTQ 0x0000C0000000, R18 AND?F 0xFFFF0000FFFF, R18 OR?F 0x00003BFF0000, R18 TSTQ 0xC00000000000, R18 AND?F 0x0000FFFFFFFF, R18 OR?F 0x3BFF00000000, R18 BRA .L0 Which was valid ASM in my case, but the constants are still bulky. Unsigned SIMD convert works over the 1.0 to 1.999 range or so. The RGB555 converter needs alpha set so that it knows pixel is opaque (otherwise, it may try to use the alpha encoding and reduce color fidelity). For now, it assumes opaque images for screen and window framebuffers. Then noted that I already had a few instructions for the purpose of range clamping, so it became: TKGDI_CopyPixelSpan_GetRGB24H: MOVU.L (R4), R6 PSHUF.W R5, 0x00, R20 PSHUF.W R5, 0x55, R21 PLDCM8UH R6, R16 MOV 0x3C003C003C003C00, R17 PMUL.H R16, R20, R16 MOV 0x3FFF3FFF3FFF3FFF, R22 PADD.H R16, R21, R16 PADD.H R16, R17, R18 PCMPGT.H R22, R18 PCSELT.W R22, R18, R18 PCMPGT.H R18, R17 PCSELT.W R17, R18, R18 MOV 0xFFFF000000000000, R7 PCVTH2UW R18, R19 OR R19, R7, R5 RGB5PCK64 R5, R2 RTS Worked a little better... But, still, not very fast. ========== REMAINDER OF ARTICLE TRUNCATED ==========