Article <utvggu$2cgkl$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <utvggu$2cgkl$1@dont-email.me>
Deutsch English Français Italiano
<utvggu$2cgkl$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB-Alt <bohannonindustriesllc@gmail.com>
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Tue, 26 Mar 2024 16:59:57 -0500
Organization: A noiseless patient Spider
Lines: 431
Message-ID: <utvggu$2cgkl$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
 <utsrft$1b76a$1@dont-email.me>
 <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
 <uttfk3$1j3o3$1@dont-email.me>
 <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Mar 2024 21:59:59 +0100 (CET)
Injection-Info: dont-email.me; posting-host="9e0688b238126de83c4cd3272be1498b";
	logging-data="2507413"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18KKWTXyCMsqJiD8XSHFnIQxZT2Aw1I74I="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:l35LhN2DlM7ntTYKtkskbThoVJM=
In-Reply-To: <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org>
Content-Language: en-US
Bytes: 16743

On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
> BGB wrote:
> 
>> On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>>
> 
>> Say, "we have an instruction, but it is a boat anchor" isn't an ideal 
>> situation (unless to be a placeholder for if/when it is not a boat 
>> anchor).
> 
> If the boat anchor is a required unit of functionality, and I believe
> IDIV and FPDIV is, it should be defined in ISA and if you can't afford
> it find some way to trap rapidly so you can fix it up without excessive
> overhead. Like a MIPS TLB reload. If you can't get trap and emulate at
> sufficient performance, then add the HW to perform the instruction.
> 

Though, 32-bit ARM managed OK without integer divide.

In my case, I ended up supporting it mostly for sake of the RV64 'M' 
extension, but it is in this case a little faster than a pure software 
solution (unlike on the K10 and Piledriver).


Still costs around 1.5 kLUTs though for 64-bit MUL/DIV support, and a 
little more to route FDIV through it.

Cheapest FPU approach is still the "ADD/SUB/MUL only" route.


>>>> again. Might also make sense to add an architectural zero register, 
>>>> and eliminate some number of encodings which exist merely because of 
>>>> the lack of a zero register (though, encodings are comparably cheap, 
>>>> as the 
>>>
>>> I got an effective zero register without having to waste a register 
>>> name to "get it". My 66000 gives you 32 registers of 64-bits each and 
>>> you can put any bit pattern in any register and treat it as you like.
>>> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
>>> available.
>>>
> 
>> I guess offloading this to the compiler can also make sense.
> 
>> Least common denominator would be, say, not providing things like NEG 
>> instructions and similar (pretending as-if one had a zero register), 
>> and if a program needs to do a NEG or similar, it can load 0 into a 
>> register itself.
> 
>> In the extreme case (say, one also lacks a designated "load immediate" 
>> instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy 
>> to zero a register...
> 
>      MOV   Rd,#imm16
> 
> Cost 1 instruction of 32-bits in size and can be performed in 0 cycles
> 

Though, RV had skipped this:
   ADD Xd, Zero, Imm12s
Or:
   LUI Xd, ImmHi20
   ADD Xd, Xd, ImmLo12s

One can argue for this on the basis of not needing an immediate-load 
instruction (nor a MOV instruction, nor NEG, nor ...).


Though, yeah, in my case I ended up with more variety here:
   LDIZ   Imm10u, Rn  //10-bit, zero-extend, Imm12u (XG2)
   LDIN   Imm10n, Rn  //10-bit, one-extend, Imm12n (XG2)
   LDIMIZ Imm10u, Rn  //Rn=Imm10u<<16 (newish)
   LDIMIN Imm10n, Rn  //Rn=Imm10n<<16 (newish)
   LDIHI  Imm10u, Rn  //Rn=Imm10u<<22
   LDIQHI Imm10u, Rn  //Rn=Imm10u<<54

   LDIZ   Imm16u, Rn  //16-bit, zero-extend
   LDIN   Imm16n, Rn  //16-bit, one-extend

Then 64-bit jumbo forms:
   LDI    Imm33s, Rn  //33-bit, sign-extend
   LDIHI  Imm33s, Rn  //Rn=Imm33s<<16
   LDIQHI Imm33s, Rn  //Rn=Imm33s<<32

Then, 96 bit:
   LDI    Imm64, Rn  //64-bit, sign-extend

And, some special cases:
   FLDCH  Imm16u, Rn //Binary16->Binary64

One could argue though that this is wild extravagance...

The recent addition of LDIMIx was mostly because otherwise one needed a 
64-bit encoding to load constants like 262144 or similar (and a lot of 
bit-masks).


At one point I did evaluate a more ARM32-like approach (effectively 
using a small value and a rotate). But, this cost more than the other 
options (would have required the great evil of effectively being able to 
feed two immediate values into the integer-shift unit, whereas many of 
the others could be routed through logic I already have for other ops).


Though, one can argue that the drawback is that one does end up with 
more instructions in the ISA listing.


>> Say:
>>    XOR R14, R14, R14  //Designate R14 as pseudo-zero...
>>    ...
>>    ADD R14, 0x123, R8  //Load 0x123 into R8
> 
>> Though, likely still makes sense in this case to provide some 
>> "convenience" instructions.
> 
> 
>>>> internal uArch has a zero register, and effectively treats immediate 
>>>> values as a special register as well, ...). Some of the debate is 
>>>> more related to the logic cost of dealing with some things in the 
>>>> decoder.
>>>
>>> The problem is universal constants. RISCs being notably poor in their
>>> support--however this is better than addressing modes which require
>>> µCode.
>>>
> 
>> Yeah.
> 
>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly 
>> orthogonal, but mostly works.
> 
>> Allows, say:
>>    ADD R4, 0x12345678, R6
> 
>> To be performed in potentially 1 clock-cycle and with a 64-bit 
>> encoding, which is better than, say:
>>    LUI X8, 0x12345
>>    ADD X8, X8, 0x678
>>    ADD X12, X10, X8
> 
> This strategy completely fails when the constant contains more than 32-bits
> 
>      FDIV   R9,#3.141592653589247,R17
> 
> When you have universal constants (including 5-bit immediates), you rarely
> need a register containing 0.
> 

The jumbo prefixes at least allow for a 64-bit constant load, but as-is 
not for 64-bit immediate values to 3RI ops. The latter could be done, 
but would require 128-bit fetch and decode, which doesn't seem worth it.

There is the limbo feature of allowing for 57-bit immediate values, but 
this is optional.


OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with 
Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Typical GCC response on RV64 seems to be to turn nearly all of the 
big-constant cases into memory loads, which kinda sucks.

Even something like a "LI Xd, Imm17s" instruction, would notably reduce 
the number of constants loaded from memory (as GCC seemingly prefers to 
use a LHU or LW or similar rather than encode it using LUI+ADD).


I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or 
S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping 
them enabled in the CPU core (they involved the non-zero cost of 
repacking them into Binary16 in ID1 and then throwing a 
Binary16->Binary64 converter into the ID2 stage).

========== REMAINDER OF ARTICLE TRUNCATED ==========