Deutsch English Français Italiano |
<utvggu$2cgkl$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB-Alt <bohannonindustriesllc@gmail.com> Newsgroups: comp.arch Subject: Re: Microarch Club Date: Tue, 26 Mar 2024 16:59:57 -0500 Organization: A noiseless patient Spider Lines: 431 Message-ID: <utvggu$2cgkl$1@dont-email.me> References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 26 Mar 2024 21:59:59 +0100 (CET) Injection-Info: dont-email.me; posting-host="9e0688b238126de83c4cd3272be1498b"; logging-data="2507413"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18KKWTXyCMsqJiD8XSHFnIQxZT2Aw1I74I=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:l35LhN2DlM7ntTYKtkskbThoVJM= In-Reply-To: <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> Content-Language: en-US Bytes: 16743 On 3/26/2024 2:16 PM, MitchAlsup1 wrote: > BGB wrote: > >> On 3/25/2024 5:17 PM, MitchAlsup1 wrote: >>> BGB-Alt wrote: >>> > >> Say, "we have an instruction, but it is a boat anchor" isn't an ideal >> situation (unless to be a placeholder for if/when it is not a boat >> anchor). > > If the boat anchor is a required unit of functionality, and I believe > IDIV and FPDIV is, it should be defined in ISA and if you can't afford > it find some way to trap rapidly so you can fix it up without excessive > overhead. Like a MIPS TLB reload. If you can't get trap and emulate at > sufficient performance, then add the HW to perform the instruction. > Though, 32-bit ARM managed OK without integer divide. In my case, I ended up supporting it mostly for sake of the RV64 'M' extension, but it is in this case a little faster than a pure software solution (unlike on the K10 and Piledriver). Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a little more to route FDIV through it. Cheapest FPU approach is still the "ADD/SUB/MUL only" route. >>>> again. Might also make sense to add an architectural zero register, >>>> and eliminate some number of encodings which exist merely because of >>>> the lack of a zero register (though, encodings are comparably cheap, >>>> as the >>> >>> I got an effective zero register without having to waste a register >>> name to "get it". My 66000 gives you 32 registers of 64-bits each and >>> you can put any bit pattern in any register and treat it as you like. >>> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally >>> available. >>> > >> I guess offloading this to the compiler can also make sense. > >> Least common denominator would be, say, not providing things like NEG >> instructions and similar (pretending as-if one had a zero register), >> and if a program needs to do a NEG or similar, it can load 0 into a >> register itself. > >> In the extreme case (say, one also lacks a designated "load immediate" >> instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy >> to zero a register... > > MOV Rd,#imm16 > > Cost 1 instruction of 32-bits in size and can be performed in 0 cycles > Though, RV had skipped this: ADD Xd, Zero, Imm12s Or: LUI Xd, ImmHi20 ADD Xd, Xd, ImmLo12s One can argue for this on the basis of not needing an immediate-load instruction (nor a MOV instruction, nor NEG, nor ...). Though, yeah, in my case I ended up with more variety here: LDIZ Imm10u, Rn //10-bit, zero-extend, Imm12u (XG2) LDIN Imm10n, Rn //10-bit, one-extend, Imm12n (XG2) LDIMIZ Imm10u, Rn //Rn=Imm10u<<16 (newish) LDIMIN Imm10n, Rn //Rn=Imm10n<<16 (newish) LDIHI Imm10u, Rn //Rn=Imm10u<<22 LDIQHI Imm10u, Rn //Rn=Imm10u<<54 LDIZ Imm16u, Rn //16-bit, zero-extend LDIN Imm16n, Rn //16-bit, one-extend Then 64-bit jumbo forms: LDI Imm33s, Rn //33-bit, sign-extend LDIHI Imm33s, Rn //Rn=Imm33s<<16 LDIQHI Imm33s, Rn //Rn=Imm33s<<32 Then, 96 bit: LDI Imm64, Rn //64-bit, sign-extend And, some special cases: FLDCH Imm16u, Rn //Binary16->Binary64 One could argue though that this is wild extravagance... The recent addition of LDIMIx was mostly because otherwise one needed a 64-bit encoding to load constants like 262144 or similar (and a lot of bit-masks). At one point I did evaluate a more ARM32-like approach (effectively using a small value and a rotate). But, this cost more than the other options (would have required the great evil of effectively being able to feed two immediate values into the integer-shift unit, whereas many of the others could be routed through logic I already have for other ops). Though, one can argue that the drawback is that one does end up with more instructions in the ISA listing. >> Say: >> XOR R14, R14, R14 //Designate R14 as pseudo-zero... >> ... >> ADD R14, 0x123, R8 //Load 0x123 into R8 > >> Though, likely still makes sense in this case to provide some >> "convenience" instructions. > > >>>> internal uArch has a zero register, and effectively treats immediate >>>> values as a special register as well, ...). Some of the debate is >>>> more related to the logic cost of dealing with some things in the >>>> decoder. >>> >>> The problem is universal constants. RISCs being notably poor in their >>> support--however this is better than addressing modes which require >>> µCode. >>> > >> Yeah. > >> I ended up with jumbo-prefixes. Still not perfect, and not perfectly >> orthogonal, but mostly works. > >> Allows, say: >> ADD R4, 0x12345678, R6 > >> To be performed in potentially 1 clock-cycle and with a 64-bit >> encoding, which is better than, say: >> LUI X8, 0x12345 >> ADD X8, X8, 0x678 >> ADD X12, X10, X8 > > This strategy completely fails when the constant contains more than 32-bits > > FDIV R9,#3.141592653589247,R17 > > When you have universal constants (including 5-bit immediates), you rarely > need a register containing 0. > The jumbo prefixes at least allow for a 64-bit constant load, but as-is not for 64-bit immediate values to 3RI ops. The latter could be done, but would require 128-bit fetch and decode, which doesn't seem worth it. There is the limbo feature of allowing for 57-bit immediate values, but this is optional. OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline. Typical GCC response on RV64 seems to be to turn nearly all of the big-constant cases into memory loads, which kinda sucks. Even something like a "LI Xd, Imm17s" instruction, would notably reduce the number of constants loaded from memory (as GCC seemingly prefers to use a LHU or LW or similar rather than encode it using LUI+ADD). I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping them enabled in the CPU core (they involved the non-zero cost of repacking them into Binary16 in ID1 and then throwing a Binary16->Binary64 converter into the ID2 stage). ========== REMAINDER OF ARTICLE TRUNCATED ==========