Deutsch   English   Français   Italiano  
<uu1o2p$30cnr$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Wed, 27 Mar 2024 13:21:04 -0500
Organization: A noiseless patient Spider
Lines: 343
Message-ID: <uu1o2p$30cnr$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
 <utsrft$1b76a$1@dont-email.me>
 <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
 <uttfk3$1j3o3$1@dont-email.me>
 <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
 <utvggu$2cgkl$1@dont-email.me>
 <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 27 Mar 2024 18:21:14 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1724f6b222b7d6e5637447713cb23601";
	logging-data="3158779"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+rG3KEYjMp0HKOhQuK3ce4plqkH/NZkN4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:S0vCQiaILZ2kZhi23n1GrJmkUxc=
Content-Language: en-US
In-Reply-To: <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
Bytes: 13973

On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
> 
>> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly 
>>>> orthogonal, but mostly works.
>>>
>>>> Allows, say:
>>>>    ADD R4, 0x12345678, R6
>>>
>>>> To be performed in potentially 1 clock-cycle and with a 64-bit 
>>>> encoding, which is better than, say:
>>>>    LUI X8, 0x12345
>>>>    ADD X8, X8, 0x678
>>>>    ADD X12, X10, X8
>>>
>>> This strategy completely fails when the constant contains more than 
>>> 32-bits
>>>
>>>      FDIV   R9,#3.141592653589247,R17
>>>
>>> When you have universal constants (including 5-bit immediates), you 
>>> rarely
>>> need a register containing 0.
>>>
> 
>> The jumbo prefixes at least allow for a 64-bit constant load, but 
>> as-is not for 64-bit immediate values to 3RI ops. The latter could be 
>> done, but would require 128-bit fetch and decode, which doesn't seem 
>> worth it.
> 
>> There is the limbo feature of allowing for 57-bit immediate values, 
>> but this is optional.
> 
> 
>> OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with 
>> Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.
> 
> Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
> and a LD to get the value from data memory within ±2GB of IP. This takes
> 3 instructions and 2 words in memory when universal constants do this in
> 1 instruction and 2 words in the code stream to do this.
> 

I was mostly testing with GCC for RV64, but, yeah, it just does memory 
loads.


>> Typical GCC response on RV64 seems to be to turn nearly all of the 
>> big-constant cases into memory loads, which kinda sucks.
> 
> This is typical when the underlying architecture is not very extensible 
> to 64-bit virtual address spaces; they have to waste a portion of the 
> 32-bit space to get access to all the 64-bit space. Universal constants
> makes this problem vanish.
> 

Yeah.

It at least seems worthwhile to have non-suck fallback strategies.



>> Even something like a "LI Xd, Imm17s" instruction, would notably 
>> reduce the number of constants loaded from memory (as GCC seemingly 
>> prefers to use a LHU or LW or similar rather than encode it using 
>> LUI+ADD).
> 
> Reduce when compared to RISC-V but increased when compared to My 66000.
> My 66000 has (at 99& level) uses no instructions to fetch or create 
> constants, nor does it waste any register (or registers) to hold use
> once constants.
> 

Yeah, but the issue here is mostly with RISC-V and its lack of constant 
load.

Or burning lots of encoding space (on LUI and AUIPC), but still lacks 
good general-purpose options.

FWIW, if they had:
   LI     Xd, Imm17s
   SHORI  Xd, Imm16u   //Xd=(Xd<<16)|Imm16u

Would have still allowed a 32-bit constant in 2 ops, but would have also 
allowed 64-bit in 4 ops (within the limits of fixed-length 32-bit 
instructions); while also needing significantly less encoding space 
(could fit both of them into the remaining space in the OP-IMM-32 block).


>> I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or 
>> S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping 
>> them enabled in the CPU core (they involved the non-zero cost of 
>> repacking them into Binary16 in ID1 and then throwing a 
>> Binary16->Binary64 converter into the ID2 stage).
> 
>> Generally, the "FLDCH Imm16, Rn" instruction works well enough here 
>> (and can leverage a more generic Binary16->Binary64 converter path).
> 
> Sometimes I see a::
> 
>      CVTSD     R2,#5
> 
> Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed 
> in register R2 so it can be accesses as an argument in the subroutine call
> to happen in a few instructions.
> 

I had looked into, say:
   FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.

Unpacking imm5fp to Binary16 being, essentially:
   aee.fff -> 0.aAAee.fff0000000



OTOH, can note that a majority of typical floating point constants can 
be represented exactly in Binary16 (well, excluding "0.1" or similar), 
so it works OK as an immediate format.

This allows a single 32-bit op to be used for constant loads (nevermind 
if one needs a 96 bit encoding for 0.1, or PI, or ...).


IIRC, I originally had it in the CONV2 path, which would give it a 
2-cycle latency and only allow it in Lane 1.

I later migrated the logic to the "MOV_IR" path which also deals with 
"Rn=(Rn<<16)|Imm" and similar, and currently allows 1-cycle Binary16 
immediate-loads in all 3 lanes.

Though, BGBCC still assumes it is Lane-1 only unless the FPU Immediate 
extension is enabled (as with the other FP converters).


> Mostly, a floating point immediate is available from a 32-bit constant
> container. When accesses in a float calculation it is used as IEEE32
> when accessed by a 6double calculation IEEE32->IEEE64 promotion is
> performed in the constant delivery path. So, one can use almost any
> floating point constant that is representable in float as a double
> without eating cycles and while saving code footprint.
> 

Don't currently have the encoding space for this.

Could in theory pull off truncated Binary32 an Imm29s form, but not 
likely worth it. Would also require putting a converted in the ID2 
stage, so not free.

In this case, the issue is more one of LUT cost to support these cases.


>> For FPU compare with zero, can almost leverage the integer compare 
>> ops, apart from the annoying edge cases of -0.0 and NaN leading to 
>> "not strictly equivalent" behavior (though, an ASM programmer could 
>> more easily get away with this). But, not common enough to justify 
>> adding FPU specific ops for this.
> 
> Actually, the edge/noise cases are not that many gates.
> a) once you are separating out NaNs, infinities are free !!
> b) once you are checking denorms for zero, infinites become free !!
> 
> Having structured a Compare-to-zero circuit based on the fields in double;
> You can compose the terns to do all signed and unsigned integers and get
> a gate count, then the number of gates you add to cover all 10 cases of 
> floating point is 12% gate count over the simple integer version. Also
> note:: this circuit is about 10% of the gate count of an integer adder.
> 

I could add them, but, is it worth it?...
========== REMAINDER OF ARTICLE TRUNCATED ==========