Deutsch English Français Italiano |
<uu1o2p$30cnr$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Microarch Club Date: Wed, 27 Mar 2024 13:21:04 -0500 Organization: A noiseless patient Spider Lines: 343 Message-ID: <uu1o2p$30cnr$1@dont-email.me> References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me> <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 27 Mar 2024 18:21:14 +0100 (CET) Injection-Info: dont-email.me; posting-host="1724f6b222b7d6e5637447713cb23601"; logging-data="3158779"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+rG3KEYjMp0HKOhQuK3ce4plqkH/NZkN4=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:S0vCQiaILZ2kZhi23n1GrJmkUxc= Content-Language: en-US In-Reply-To: <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org> Bytes: 13973 On 3/26/2024 7:02 PM, MitchAlsup1 wrote: > BGB-Alt wrote: > >> On 3/26/2024 2:16 PM, MitchAlsup1 wrote: >>> BGB wrote: >>> >>>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly >>>> orthogonal, but mostly works. >>> >>>> Allows, say: >>>> ADD R4, 0x12345678, R6 >>> >>>> To be performed in potentially 1 clock-cycle and with a 64-bit >>>> encoding, which is better than, say: >>>> LUI X8, 0x12345 >>>> ADD X8, X8, 0x678 >>>> ADD X12, X10, X8 >>> >>> This strategy completely fails when the constant contains more than >>> 32-bits >>> >>> FDIV R9,#3.141592653589247,R17 >>> >>> When you have universal constants (including 5-bit immediates), you >>> rarely >>> need a register containing 0. >>> > >> The jumbo prefixes at least allow for a 64-bit constant load, but >> as-is not for 64-bit immediate values to 3RI ops. The latter could be >> done, but would require 128-bit fetch and decode, which doesn't seem >> worth it. > >> There is the limbo feature of allowing for 57-bit immediate values, >> but this is optional. > > >> OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with >> Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline. > > Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC > and a LD to get the value from data memory within ±2GB of IP. This takes > 3 instructions and 2 words in memory when universal constants do this in > 1 instruction and 2 words in the code stream to do this. > I was mostly testing with GCC for RV64, but, yeah, it just does memory loads. >> Typical GCC response on RV64 seems to be to turn nearly all of the >> big-constant cases into memory loads, which kinda sucks. > > This is typical when the underlying architecture is not very extensible > to 64-bit virtual address spaces; they have to waste a portion of the > 32-bit space to get access to all the 64-bit space. Universal constants > makes this problem vanish. > Yeah. It at least seems worthwhile to have non-suck fallback strategies. >> Even something like a "LI Xd, Imm17s" instruction, would notably >> reduce the number of constants loaded from memory (as GCC seemingly >> prefers to use a LHU or LW or similar rather than encode it using >> LUI+ADD). > > Reduce when compared to RISC-V but increased when compared to My 66000. > My 66000 has (at 99& level) uses no instructions to fetch or create > constants, nor does it waste any register (or registers) to hold use > once constants. > Yeah, but the issue here is mostly with RISC-V and its lack of constant load. Or burning lots of encoding space (on LUI and AUIPC), but still lacks good general-purpose options. FWIW, if they had: LI Xd, Imm17s SHORI Xd, Imm16u //Xd=(Xd<<16)|Imm16u Would have still allowed a 32-bit constant in 2 ops, but would have also allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit instructions); while also needing significantly less encoding space (could fit both of them into the remaining space in the OP-IMM-32 block). >> I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or >> S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping >> them enabled in the CPU core (they involved the non-zero cost of >> repacking them into Binary16 in ID1 and then throwing a >> Binary16->Binary64 converter into the ID2 stage). > >> Generally, the "FLDCH Imm16, Rn" instruction works well enough here >> (and can leverage a more generic Binary16->Binary64 converter path). > > Sometimes I see a:: > > CVTSD R2,#5 > > Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed > in register R2 so it can be accesses as an argument in the subroutine call > to happen in a few instructions. > I had looked into, say: FADD Rm, Imm5fp, Rn Where, despite Imm5fp being severely limited, it had an OK hit rate. Unpacking imm5fp to Binary16 being, essentially: aee.fff -> 0.aAAee.fff0000000 OTOH, can note that a majority of typical floating point constants can be represented exactly in Binary16 (well, excluding "0.1" or similar), so it works OK as an immediate format. This allows a single 32-bit op to be used for constant loads (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...). IIRC, I originally had it in the CONV2 path, which would give it a 2-cycle latency and only allow it in Lane 1. I later migrated the logic to the "MOV_IR" path which also deals with "Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16 immediate-loads in all 3 lanes. Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate extension is enabled (as with the other FP converters). > Mostly, a floating point immediate is available from a 32-bit constant > container. When accesses in a float calculation it is used as IEEE32 > when accessed by a 6double calculation IEEE32->IEEE64 promotion is > performed in the constant delivery path. So, one can use almost any > floating point constant that is representable in float as a double > without eating cycles and while saving code footprint. > Don't currently have the encoding space for this. Could in theory pull off truncated Binary32 an Imm29s form, but not likely worth it. Would also require putting a converted in the ID2 stage, so not free. In this case, the issue is more one of LUT cost to support these cases. >> For FPU compare with zero, can almost leverage the integer compare >> ops, apart from the annoying edge cases of -0.0 and NaN leading to >> "not strictly equivalent" behavior (though, an ASM programmer could >> more easily get away with this). But, not common enough to justify >> adding FPU specific ops for this. > > Actually, the edge/noise cases are not that many gates. > a) once you are separating out NaNs, infinities are free !! > b) once you are checking denorms for zero, infinites become free !! > > Having structured a Compare-to-zero circuit based on the fields in double; > You can compose the terns to do all signed and unsigned integers and get > a gate count, then the number of gates you add to cover all 10 cases of > floating point is 12% gate count over the simple integer version. Also > note:: this circuit is about 10% of the gate count of an integer adder. > I could add them, but, is it worth it?... ========== REMAINDER OF ARTICLE TRUNCATED ==========