Article <vb002r$156ge$1@dont-email.me>

Deutsch English Français Italiano
<vb002r$156ge$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Sat, 31 Aug 2024 15:56:56 -0500
Organization: A noiseless patient Spider
Lines: 370
Message-ID: <vb002r$156ge$1@dont-email.me>
References: <vajo7i$2s028$1@dont-email.me>
 <memo.20240827205925.19028i@jgd.cix.co.uk> <valki8$35fk2$1@dont-email.me>
 <2644ef96e12b369c5fce9231bfc8030d@www.novabbs.org>
 <vam5qo$3bb7o$1@dont-email.me>
 <2f1a154a34f72709b0a23ac8e750b02b@www.novabbs.org>
 <vaoqcf$3r1u3$1@dont-email.me> <vavgq7$12u29$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 31 Aug 2024 22:57:00 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="d511112154b30627d1940cff53b8d4ab";
	logging-data="1219086"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+oD56APsc4mUxaXzKbCOHVpBhsM0iQqpc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:srxkVXiAhHygrqMA4+Ct+1orgYI=
Content-Language: en-US
In-Reply-To: <vavgq7$12u29$1@dont-email.me>
Bytes: 15764

On 8/30/2024 7:11 PM, Paul A. Clayton wrote:
> On 8/28/24 11:36 PM, BGB wrote:
>> On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
> [snip]
>>> My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
>>> Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
>>
>>
>> It likely isn't going to happen because a 1-wide machine isn't going 
>> to have the needed register ports.
> 
> For an in-order implementation, banking could be used for saving
> a contiguous range of registers with no bank conflicts.
> 
> Mitch Alsup chose to provide four read/write ports with the
> typical use being three read, one write instructions. This not
> only facilitates faster register save/restore for function calls
> (and context switches/interrupts) but presents the opportunity of
> limited dual issue ("CoIssue").
> 

I was mostly doing dual-issue with a 4R2W design.

Initially, 6R3W won out mostly because 4R2W disallows an indexed store 
to be run in parallel with another op; but 6R3W did allow this. This 
scenario made enough of a difference to seemingly justify the added cost 
of a 3-wide design with a 3rd lane that goes mostly unused (and is 
mostly limited to register MOV's and basic ALU ops and similar).


But, then this leads to an annoyance:
As is, I will need to generate different code for 1W, 2W, and 3W 
configurations;
It is starting to become tempting to generate code resembling that for 
the 1W case (albeit still using the shuffling that would be used when 
bundling), and then using superscalar since, it turns out, it is not 
quite as expensive as I had thought).

With superscalar, I wouldn't have the issue of 2W and 3W cores having 
issues running code built for the other.



Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X 
(load/store pair) instruction, so if one assumes 2-wide as the minimum, 
this instruction can be safely assumed to exist.

I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have 
ended up mostly deciding to relegate these to RISC-V.

By the time I have stripped down BJX2 enough to fit into a small FPGA, 
it essentially has almost nothing to offer that RV wouldn't offer 
already (and it makes more practical sense to use something like RV32IM 
or similar).



I am not sure how one would efficiently pull off a 4W write operation.



Can note that generally, the GPR part of the register file can be built 
with LUTRAMs, which on Xilinx chips have the property:
   1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
   1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.


This means, the number of LUTRAMs needed for NxM with G registers can be 
calculated:
   2R1W, 32, Cost=44
   3R1W, 32, Cost=66
   4R2W, 32, Cost=176
   6R3W, 32, Cost=396
   4R4W, 32, Cost=352
   6R4W, 32, Cost=528

   2R1W, 64, Cost=64
   3R1W, 64, Cost=96
   4R2W, 64, Cost=256
   6R3W, 64, Cost=576
   4R4W, 64, Cost=512
   6R4W, 64, Cost=768

   10R5W, 64, cost=1600.


There is also the mUX logic and similar, but should follow the same pattern.

There is a bit-array (2b per register) to indicate which of the arrays 
holds each register. This ends up turning into FFs, but doesn't matter 
as much.

In the Verilog, one can write it as-if there were only 1 array per write 
port, with the duplication (for the read ports) handled transparently by 
the synthesis stage (convenient), although it still has a steep resource 
cost.



I think Altera uses a different system, IIRC with 4 or 8 bit addresses, 
4-bit data, and read/write need clock-edges (as with Block RAM on 
Xilinx). When I tried experimentally to build for an Altera FPGA, I 
switched over to doing all the GPRs with FF's and state machines, as 
ironically this was cheaper than the code synthesized for LUTRAMs.

The core took up pretty much the whole FPGA when I told it to target a 
DE10 Nano (I don't actually have one, so this was a what if). Though, I 
do remember that (despite the very inefficient resource usage), its 
"Fmax" value was somewhat higher than I am generally running at.


Where, for FF based registers, it was a state machine something like:
   output[63:0] regOut;
   input[63:0] regInA;
   input[6:0] regIdA;
   input[63:0] regInB;
   input[6:0] regIdB;
   input[63:0] regInC;
   input[6:0] regIdC;
   input[6:0] regIdSelf;
   input      isHold;
   input      isFlush;

   reg[63:0] regVal;
   assign regOut=regVal;

   reg isA;
   reg isB;
   reg isC;
   reg tDoUpd;
   reg[63:0] tValUpd;
   always @*
   begin
     isA=regIdA==regIdSelf;
     isB=regIdB==regIdSelf;
     isC=regIdC==regIdSelf;
     tDoUpd=0;
     tValUpd=64'hXXXX_XXXX_XXXX_XXXX;
     casez({isFlush,isA,isB,isC})
       4'b1zzz: begin end
       4'b01zz: begin tValUpd=regInA; tDoUpd=1; end
       4'b001z: begin tValUpd=regInB; tDoUpd=1; end
       4'b0001: begin tValUpd=regInC; tDoUpd=1; end
       4'b0000: begin end
     endcase
   end
   always @(posedge clock)
   begin
     if(tDoUpd && !isHold)
     begin
       regVal <= tValUpd;
     end
   end

With each read port being a case block:
   case(regIdRs)
      JX2_GR_R2: tRegValRsA0=regValR2;
      JX2_GR_R3: tRegValRsA0=regValR3;
      ...
   case(regIdRt)
      JX2_GR_R2: tRegValRtA0=regValR2;
      JX2_GR_R3: tRegValRtA0=regValR3;
      ...
   ...

This works, but has a fairly steep per-register cost.
Cost in this case seems to be more dominated by the number of read-ports 
and the number of registers (write ports seem to be comparably cheap in 
this scenario).

Then, there is the forwarding logic, with a cost function mostly 
dependent on the product of the number of read ports and pipeline EX 
stages (and WB).

========== REMAINDER OF ARTICLE TRUNCATED ==========