Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Tonights Tradeoff Date: Thu, 12 Sep 2024 16:46:04 +0000 Organization: Rocksolid Light Message-ID: <718895dfd5c344865453f710367501ba@www.novabbs.org> References: <17537125c53e616e22f772e5bcd61943@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="1804848"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$N.mVhod4q2CWV9qqh/S/muqH5VLxMixvkmjuq/MNIp.FnsG4AlcBG Bytes: 3441 Lines: 52 On Thu, 12 Sep 2024 3:37:22 +0000, Robert Finch wrote: > On 2024-09-11 11:48 a.m., Stephen Fuld wrote: >> On 9/11/2024 6:54 AM, Robert Finch wrote: >> >> snip >> >> >>> I have found that there can be a lot of registers available if they >>> are implemented in BRAMs. BRAMs have lots of depth compared to LUT >>> RAMs. BRAMs have a one cycle latency but that is just part of the >>> pipeline. In Q+ about 40k LUTs are being used just to keep track of >>> registers. (rename mappings and checkpoints). >>> >>> Given a lot of available registers I keep considering trying a VLIW >>> design similar to the Itanium, rotating register and all. But I have a >>> lot invested in OoO. >>> >>> >>> Q+ has seven in-order pipeline stages before things get to the re- >>> order buffer. >> >> Does each of these take a clock cycle?  If so, that seems excessive. >> What is your cost for a mis-predicted branch? >> >> >> >> > Each stage takes one clock cycle. Unconditional branches are detected at > the second stage and taken then so they do not consume as many clocks. > There are two extra stages to handle vector instructions. Those two > stages could be removed if vectors are not needed. > > Mis-predicted branches are really expensive. They take about six clocks, > plus the seven clocks to refill the pipeline, so it is about 13 clocks. > Seems like it should be possible to reduce the number of clocks of > processing during the miss, but I have not got around to it yet. There > is a branch miss state machine that restores the checkpoint. Branches > need a lot of work yet. In a machine I did in 1990-2 we would fetch down the alternate path and put the recovery instructions in a buffer, so when a branch was mispredicted, the instructions were already present. So, you can't help the 6 cycles of branch verification latency, but you can fix the pipeline refill latency. We got 2.05 i/c on XLISP SPECnit 89 mostly because of the low backup overhead. > I am not sure how well the branch prediction works. Instruction runs in > SIM are not long enough yet. Something in the AGEN/TLB/LSQ is not > working correctly yet, leading to bad memory cycles.