Deutsch English Français Italiano |
<vb002r$156ge$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!feeds.phibee-telecom.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Sat, 31 Aug 2024 15:56:56 -0500 Organization: A noiseless patient Spider Lines: 370 Message-ID: <vb002r$156ge$1@dont-email.me> References: <vajo7i$2s028$1@dont-email.me> <memo.20240827205925.19028i@jgd.cix.co.uk> <valki8$35fk2$1@dont-email.me> <2644ef96e12b369c5fce9231bfc8030d@www.novabbs.org> <vam5qo$3bb7o$1@dont-email.me> <2f1a154a34f72709b0a23ac8e750b02b@www.novabbs.org> <vaoqcf$3r1u3$1@dont-email.me> <vavgq7$12u29$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 31 Aug 2024 22:57:00 +0200 (CEST) Injection-Info: dont-email.me; posting-host="d511112154b30627d1940cff53b8d4ab"; logging-data="1219086"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+oD56APsc4mUxaXzKbCOHVpBhsM0iQqpc=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:srxkVXiAhHygrqMA4+Ct+1orgYI= Content-Language: en-US In-Reply-To: <vavgq7$12u29$1@dont-email.me> Bytes: 15764 On 8/30/2024 7:11 PM, Paul A. Clayton wrote: > On 8/28/24 11:36 PM, BGB wrote: >> On 8/28/2024 11:40 AM, MitchAlsup1 wrote: > [snip] >>> My 1-wide machines does ENTER and EXIT at 4 registers per cycle. >>> Try doing 4 LDs or 4 STs per cycle on a 1-wide machine. >> >> >> It likely isn't going to happen because a 1-wide machine isn't going >> to have the needed register ports. > > For an in-order implementation, banking could be used for saving > a contiguous range of registers with no bank conflicts. > > Mitch Alsup chose to provide four read/write ports with the > typical use being three read, one write instructions. This not > only facilitates faster register save/restore for function calls > (and context switches/interrupts) but presents the opportunity of > limited dual issue ("CoIssue"). > I was mostly doing dual-issue with a 4R2W design. Initially, 6R3W won out mostly because 4R2W disallows an indexed store to be run in parallel with another op; but 6R3W did allow this. This scenario made enough of a difference to seemingly justify the added cost of a 3-wide design with a 3rd lane that goes mostly unused (and is mostly limited to register MOV's and basic ALU ops and similar). But, then this leads to an annoyance: As is, I will need to generate different code for 1W, 2W, and 3W configurations; It is starting to become tempting to generate code resembling that for the 1W case (albeit still using the shuffling that would be used when bundling), and then using superscalar since, it turns out, it is not quite as expensive as I had thought). With superscalar, I wouldn't have the issue of 2W and 3W cores having issues running code built for the other. Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X (load/store pair) instruction, so if one assumes 2-wide as the minimum, this instruction can be safely assumed to exist. I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have ended up mostly deciding to relegate these to RISC-V. By the time I have stripped down BJX2 enough to fit into a small FPGA, it essentially has almost nothing to offer that RV wouldn't offer already (and it makes more practical sense to use something like RV32IM or similar). I am not sure how one would efficiently pull off a 4W write operation. Can note that generally, the GPR part of the register file can be built with LUTRAMs, which on Xilinx chips have the property: 1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write. 1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write. This means, the number of LUTRAMs needed for NxM with G registers can be calculated: 2R1W, 32, Cost=44 3R1W, 32, Cost=66 4R2W, 32, Cost=176 6R3W, 32, Cost=396 4R4W, 32, Cost=352 6R4W, 32, Cost=528 2R1W, 64, Cost=64 3R1W, 64, Cost=96 4R2W, 64, Cost=256 6R3W, 64, Cost=576 4R4W, 64, Cost=512 6R4W, 64, Cost=768 10R5W, 64, cost=1600. There is also the mUX logic and similar, but should follow the same pattern. There is a bit-array (2b per register) to indicate which of the arrays holds each register. This ends up turning into FFs, but doesn't matter as much. In the Verilog, one can write it as-if there were only 1 array per write port, with the duplication (for the read ports) handled transparently by the synthesis stage (convenient), although it still has a steep resource cost. I think Altera uses a different system, IIRC with 4 or 8 bit addresses, 4-bit data, and read/write need clock-edges (as with Block RAM on Xilinx). When I tried experimentally to build for an Altera FPGA, I switched over to doing all the GPRs with FF's and state machines, as ironically this was cheaper than the code synthesized for LUTRAMs. The core took up pretty much the whole FPGA when I told it to target a DE10 Nano (I don't actually have one, so this was a what if). Though, I do remember that (despite the very inefficient resource usage), its "Fmax" value was somewhat higher than I am generally running at. Where, for FF based registers, it was a state machine something like: output[63:0] regOut; input[63:0] regInA; input[6:0] regIdA; input[63:0] regInB; input[6:0] regIdB; input[63:0] regInC; input[6:0] regIdC; input[6:0] regIdSelf; input isHold; input isFlush; reg[63:0] regVal; assign regOut=regVal; reg isA; reg isB; reg isC; reg tDoUpd; reg[63:0] tValUpd; always @* begin isA=regIdA==regIdSelf; isB=regIdB==regIdSelf; isC=regIdC==regIdSelf; tDoUpd=0; tValUpd=64'hXXXX_XXXX_XXXX_XXXX; casez({isFlush,isA,isB,isC}) 4'b1zzz: begin end 4'b01zz: begin tValUpd=regInA; tDoUpd=1; end 4'b001z: begin tValUpd=regInB; tDoUpd=1; end 4'b0001: begin tValUpd=regInC; tDoUpd=1; end 4'b0000: begin end endcase end always @(posedge clock) begin if(tDoUpd && !isHold) begin regVal <= tValUpd; end end With each read port being a case block: case(regIdRs) JX2_GR_R2: tRegValRsA0=regValR2; JX2_GR_R3: tRegValRsA0=regValR3; ... case(regIdRt) JX2_GR_R2: tRegValRtA0=regValR2; JX2_GR_R3: tRegValRtA0=regValR3; ... ... This works, but has a fairly steep per-register cost. Cost in this case seems to be more dominated by the number of read-ports and the number of registers (write ports seem to be comparably cheap in this scenario). Then, there is the forwarding logic, with a cost function mostly dependent on the product of the number of read ports and pipeline EX stages (and WB). ========== REMAINDER OF ARTICLE TRUNCATED ==========