Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Reservation stations [was Continuations] Date: Sun, 21 Jul 2024 19:44:49 +0000 Organization: Rocksolid Light Message-ID: References: <47689j5gbdg2runh3t7oq2thodmfkalno6@4ax.com> <116d9j5651mtjmq4bkjaheuf0pgpu6p0m8@4ax.com> <7u7e9j5dthm94vb2vdsugngjf1cafhu2i4@4ax.com> <0f7b4deb1761f4c485d1dc3b21eb7cb3@www.novabbs.org> <4bbc6af7baab612635eef0de4847ba5b@www.novabbs.org> <99f80e5c5452ec87cf6f5a70dcb33863@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="4084875"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Site: $2y$10$bjuDPaeR2lVnimDzqMQMxO.xinJfAUH4AAe7FOTBR2WCCc40nfFtq X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 5370 Lines: 83 On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote: > MitchAlsup1 wrote: >> On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote: >> >>> MitchAlsup1 wrote: >>>> >>>> {Would be an interesting reservation station design, though} >>> >>> In what way would the RS be interesting or different? >> >> The instruction stream consists of 4 FMAC-bound instructions unrolled >> as many times as will fit in register file. >> >> You typical reservation station can accept 1 new instruction per cycle >> from the decoder. So, either the decoder has to spew the instructions >> across the stations (and remember they are data dependent) or the >> station has to fire more than one per cycle to the FMAC units. >> >> So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle >> and maybe some kind of exotic routing. > > This is where I saw a benefits to using valued reservation stations vs > valueless ones - when a uArch has multiple similar FU each with its own > bank of RS that is scheduled for that FU. > > Example of horizontal scaling of similar FU each with its own RS bank. > https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png > > With valueless RS, each RS stores only the source register number of > its operands and each FU has to be able to read all its operands > when a uOp launches (begins execution). This means the number of > PRF read ports scales according to the total number of FU operands. > (One could do read port sharing but then you have to schedule that too > and could have contention.) Also if an FU is unused on any cycle then > all its (expensive) operand read ports are unused. I always had RSs keep tack of which FU was delivering the final operand, so that these could be picked up by the forwarding logic and not need a RF port. This gets rid of 50%-75% of the RF port needs. > > Using the above Oryon as an example, with valueless RS, to launch > all 14 FU with 3 operands all at once needs 42 read ports. > > With valued RS the operand values stored in each RS and, if ready, > read at Dispatch (hand-off from the front end to the RS bank) or are > received from the forwarding network if in-flight at Dispatch time. Delivering result at dispatch time. > The number of PRF read ports scales with the number of dispatched uOp > operands. Since the operand values are stored in each RS, each bank > can then schedule and launch independently. The width of the decoder is narrower than the width of the data path. We used to call this "catch up bandwidth". > > With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports, First, a 6-wide machine is not doing 6 3-operand instructions, it is more like 3-memory ops (2-reg+displacement), one 3-op, one general 2-op, and one 1-op (branch) so, you only need 12-ports instead of 18 Most of the time. The penalty is that each RS entry is 5× the size of the value-free RS designs. These work just fine when the execution window is reasonable (say 96 instructions) but fails when the window is larger than 150-ish. > and the read ports are potentially usable for all dispatches. > Then all 14 FU can launch at once independently. One should also note that these machines deliver 1-2 I/c RMS regardless of their Fetch-Decode-FU widths. > > Each FU can also have two kinds of valued RS banks, > a simple one if all the operands are ready at Dispatch as this does > not need a wake-up matrix entry or need to receive forwarded values, > and a complex one that monitors the wake-up matrix and forwarding buses. > If all the operands are ready, the Dispatcher can choose either RS bank > for > the FU, giving preference to the simpler. If all operands are not ready > then Dispatcher selects from the complex bank.