Article <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
Deutsch English Français Italiano
<3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Wed, 27 Mar 2024 00:02:05 +0000
Organization: Rocksolid Light
Message-ID: <3dd12c0fe2471bf4b9fcaffaed8256ab@www.novabbs.org>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> <uttfk3$1j3o3$1@dont-email.me> <c3c8546c4792f1aadff23fd25ef8113b@www.novabbs.org> <utvggu$2cgkl$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="3331521"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$MP4yyA3Q/mGTF79uwjlLEuWMQSrb/myYSpnyznxIgQk0pmgdpu8yy
Bytes: 9147
Lines: 199

BGB-Alt wrote:

> On 3/26/2024 2:16 PM, MitchAlsup1 wrote:
>> BGB wrote:
>> 
>>> I ended up with jumbo-prefixes. Still not perfect, and not perfectly 
>>> orthogonal, but mostly works.
>> 
>>> Allows, say:
>>>    ADD R4, 0x12345678, R6
>> 
>>> To be performed in potentially 1 clock-cycle and with a 64-bit 
>>> encoding, which is better than, say:
>>>    LUI X8, 0x12345
>>>    ADD X8, X8, 0x678
>>>    ADD X12, X10, X8
>> 
>> This strategy completely fails when the constant contains more than 32-bits
>> 
>>      FDIV   R9,#3.141592653589247,R17
>> 
>> When you have universal constants (including 5-bit immediates), you rarely
>> need a register containing 0.
>> 

> The jumbo prefixes at least allow for a 64-bit constant load, but as-is 
> not for 64-bit immediate values to 3RI ops. The latter could be done, 
> but would require 128-bit fetch and decode, which doesn't seem worth it.

> There is the limbo feature of allowing for 57-bit immediate values, but 
> this is optional.


> OTOH, on the RISC-V side, one needs a minimum of 5 instructions (with 
> Zbb), or 6 instructions (without Zbb) to encode a 64-bit constant inline.

Which the LLVM compiler for RISC-V does not do, instead it uses a AUPIC
and a LD to get the value from data memory within ±2GB of IP. This takes
3 instructions and 2 words in memory when universal constants do this in
1 instruction and 2 words in the code stream to do this.

> Typical GCC response on RV64 seems to be to turn nearly all of the 
> big-constant cases into memory loads, which kinda sucks.

This is typical when the underlying architecture is not very extensible 
to 64-bit virtual address spaces; they have to waste a portion of the 
32-bit space to get access to all the 64-bit space. Universal constants
makes this problem vanish.

> Even something like a "LI Xd, Imm17s" instruction, would notably reduce 
> the number of constants loaded from memory (as GCC seemingly prefers to 
> use a LHU or LW or similar rather than encode it using LUI+ADD).

Reduce when compared to RISC-V but increased when compared to My 66000.
My 66000 has (at 99& level) uses no instructions to fetch or create 
constants, nor does it waste any register (or registers) to hold use
once constants.

> I experimented with FPU immediate values, generally E3.F2 (Imm5fp) or 
> S.E5.F4 (Imm10fp), but the gains didn't seem enough to justify keeping 
> them enabled in the CPU core (they involved the non-zero cost of 
> repacking them into Binary16 in ID1 and then throwing a 
> Binary16->Binary64 converter into the ID2 stage).

> Generally, the "FLDCH Imm16, Rn" instruction works well enough here (and 
> can leverage a more generic Binary16->Binary64 converter path).

Sometimes I see a::

     CVTSD     R2,#5

Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in 
register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.

Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.

> For FPU compare with zero, can almost leverage the integer compare ops, 
> apart from the annoying edge cases of -0.0 and NaN leading to "not 
> strictly equivalent" behavior (though, an ASM programmer could more 
> easily get away with this). But, not common enough to justify adding FPU 
> specific ops for this.

Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!

Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of 
floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.

-----------------------

> Seems that generally 0 still isn't quite common enough to justify having 
> one register fewer for variables though (or to have a designated zero 
> register), but otherwise it seems there is not much to justify trying to 
> exclude the "implicit zero" ops from the ISA listing.


It is common enough,
But there are lots of ways to get a zero where you want it for a return.

>> 
>>>>> Though, would likely still make a few decisions differently from 
>>>>> those in RISC-V. Things like indexed load/store,
>>>>
>>>> Absolutely
>>>>
>>>>>                                            predicated ops (with a 
>>>>> designated flag bit), 
>>>>
>>>> Predicated then and else clauses which are branch free.
>>>> {{Also good for constant time crypto in need of flow control...}}
>>>>
>> 
>>> I have per instruction predication:
>>>    CMPxx ...
>>>    OP?T  //if-true
>>>    OP?F  //if-false
>>> Or:
>>>    OP?T | OP?F  //both in parallel, subject to encoding and ISA rules
>> 
>>      CMP  Rt,Ra,#whatever
>>      PLE  Rt,TTTTTEEE
>>      // This begins the then-clause 5Ts -> 5 instructions
>>      OP1
>>      OP2
>>      OP3
>>      OP4
>>      OP5
>>      // this begins the else-clause 3Es -> 3 instructions
>>      OP6
>>      OP7
>>      OP8
>>      // we are now back join point.
>> 
>> Notice no internal flow control instructions.
>> 

> It can be similar in my case, with the ?T / ?F encoding scheme.

Except you eat that/those bits in OpCode encoding.

> While poking at it, did go and add a check to exclude large struct-copy 
> operations from predication, as it is slower to turn a large struct copy 
> into NOPs than to branch over it.

> Did end up leaving struct-copies where sz<=64 as allowed though (where a 
> 64 byte copy at least has the merit of achieving full pipeline 
> saturation and being roughly break-even with a branch-miss, whereas a 
> 128 byte copy would cost roughly twice as much as a branch miss).

I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line 
boundary and you access TLB only after crossing a page boundary. 

>>> Performance gains are modest, but still noticeable (part of why 
>>> predication ended up as a core ISA feature). Effect on pipeline seems 
>>> to be small in its current form (it is handled along with register 
>>> fetch, mostly turning non-executed instructions into NOPs during the 
>>> EX stages).
>> 
>> The effect is that one uses Predication whenever you will have already
>> fetched instructions at the join point by the time you have determined
>> the predicate value {then, else} clauses. The PARSE and DECODE do the
>> flow control without bothering FETCH.
>> 

> Yeah, though in my pipeline, it is still a tradeoff of the relative cost 
========== REMAINDER OF ARTICLE TRUNCATED ==========