Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Tonights Tradeoff
Date: Thu, 12 Sep 2024 16:46:04 +0000
Organization: Rocksolid Light
Message-ID: <718895dfd5c344865453f710367501ba@www.novabbs.org>
References: <vbgdms$152jq$1@dont-email.me> <17537125c53e616e22f772e5bcd61943@www.novabbs.org> <vbj5af$1puhu$1@dont-email.me> <a37e9bd652d7674493750ccc04674759@www.novabbs.org> <vbog6d$2p2rc$1@dont-email.me> <vboqpp$2r5v4$1@dont-email.me> <vbpmqr$30vto$1@dont-email.me> <vbqcds$35l1q$2@dont-email.me> <vbs7ff$3koub$1@dont-email.me> <vbse3j$f01n$2@dont-email.me> <vbtnlj$22nu$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="1804848"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$N.mVhod4q2CWV9qqh/S/muqH5VLxMixvkmjuq/MNIp.FnsG4AlcBG
Bytes: 3441
Lines: 52

On Thu, 12 Sep 2024 3:37:22 +0000, Robert Finch wrote:

> On 2024-09-11 11:48 a.m., Stephen Fuld wrote:
>> On 9/11/2024 6:54 AM, Robert Finch wrote:
>>
>> snip
>>
>>
>>> I have found that there can be a lot of registers available if they
>>> are implemented in BRAMs. BRAMs have lots of depth compared to LUT
>>> RAMs. BRAMs have a one cycle latency but that is just part of the
>>> pipeline. In Q+ about 40k LUTs are being used just to keep track of
>>> registers. (rename mappings and checkpoints).
>>>
>>> Given a lot of available registers I keep considering trying a VLIW
>>> design similar to the Itanium, rotating register and all. But I have a
>>> lot invested in OoO.
>>>
>>>
>>> Q+ has seven in-order pipeline stages before things get to the re-
>>> order buffer.
>>
>> Does each of these take a clock cycle?  If so, that seems excessive.
>> What is your cost for a mis-predicted branch?
>>
>>
>>
>>
> Each stage takes one clock cycle. Unconditional branches are detected at
> the second stage and taken then so they do not consume as many clocks.
> There are two extra stages to handle vector instructions. Those two
> stages could be removed if vectors are not needed.
>
> Mis-predicted branches are really expensive. They take about six clocks,
> plus the seven clocks to refill the pipeline, so it is about 13 clocks.
> Seems like it should be possible to reduce the number of clocks of
> processing during the miss, but I have not got around to it yet. There
> is a branch miss state machine that restores the checkpoint. Branches
> need a lot of work yet.

In a machine I did in 1990-2 we would fetch down the alternate path
and put the recovery instructions in a buffer, so when a branch was
mispredicted, the instructions were already present.

So, you can't help the 6 cycles of branch verification latency,
but you can fix the pipeline refill latency.

We got 2.05 i/c on XLISP SPECnit 89 mostly because of the low backup
overhead.

> I am not sure how well the branch prediction works. Instruction runs in
> SIM are not long enough yet. Something in the AGEN/TLB/LSQ is not
> working correctly yet, leading to bad memory cycles.