Article <vbvfcm$d2he$1@dont-email.me>

Deutsch English Français Italiano
<vbvfcm$d2he$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Robert Finch <robfi680@gmail.com>
Newsgroups: comp.arch
Subject: Re: Tonights Tradeoff
Date: Thu, 12 Sep 2024 15:28:19 -0400
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <vbvfcm$d2he$1@dont-email.me>
References: <vbgdms$152jq$1@dont-email.me>
 <17537125c53e616e22f772e5bcd61943@www.novabbs.org>
 <vbj5af$1puhu$1@dont-email.me>
 <a37e9bd652d7674493750ccc04674759@www.novabbs.org>
 <vbog6d$2p2rc$1@dont-email.me> <vboqpp$2r5v4$1@dont-email.me>
 <vbpmqr$30vto$1@dont-email.me> <vbqcds$35l1q$2@dont-email.me>
 <vbs7ff$3koub$1@dont-email.me> <vbse3j$f01n$2@dont-email.me>
 <vbtnlj$22nu$1@dont-email.me>
 <718895dfd5c344865453f710367501ba@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 12 Sep 2024 21:28:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="eb3e44d3493875ad53e931e11ab4943b";
	logging-data="428590"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+8ivkgTp4EPz9QjjHcRTuC2+5X2iBRT0k="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:CafCUW8xyZapLgxcGChoeUuqrtQ=
In-Reply-To: <718895dfd5c344865453f710367501ba@www.novabbs.org>
Content-Language: en-US
Bytes: 5026

On 2024-09-12 12:46 p.m., MitchAlsup1 wrote:
> On Thu, 12 Sep 2024 3:37:22 +0000, Robert Finch wrote:
> 
>> On 2024-09-11 11:48 a.m., Stephen Fuld wrote:
>>> On 9/11/2024 6:54 AM, Robert Finch wrote:
>>>
>>> snip
>>>
>>>
>>>> I have found that there can be a lot of registers available if they
>>>> are implemented in BRAMs. BRAMs have lots of depth compared to LUT
>>>> RAMs. BRAMs have a one cycle latency but that is just part of the
>>>> pipeline. In Q+ about 40k LUTs are being used just to keep track of
>>>> registers. (rename mappings and checkpoints).
>>>>
>>>> Given a lot of available registers I keep considering trying a VLIW
>>>> design similar to the Itanium, rotating register and all. But I have a
>>>> lot invested in OoO.
>>>>
>>>>
>>>> Q+ has seven in-order pipeline stages before things get to the re-
>>>> order buffer.
>>>
>>> Does each of these take a clock cycle?  If so, that seems excessive.
>>> What is your cost for a mis-predicted branch?
>>>
>>>
>>>
>>>
>> Each stage takes one clock cycle. Unconditional branches are detected at
>> the second stage and taken then so they do not consume as many clocks.
>> There are two extra stages to handle vector instructions. Those two
>> stages could be removed if vectors are not needed.
>>
>> Mis-predicted branches are really expensive. They take about six clocks,
>> plus the seven clocks to refill the pipeline, so it is about 13 clocks.
>> Seems like it should be possible to reduce the number of clocks of
>> processing during the miss, but I have not got around to it yet. There
>> is a branch miss state machine that restores the checkpoint. Branches
>> need a lot of work yet.
> 
> In a machine I did in 1990-2 we would fetch down the alternate path
> and put the recovery instructions in a buffer, so when a branch was
> mispredicted, the instructions were already present.
> 
> So, you can't help the 6 cycles of branch verification latency,
> but you can fix the pipeline refill latency.
> 
> We got 2.05 i/c on XLISP SPECnit 89 mostly because of the low backup
> overhead.
> 

That sounds like a good idea. The fetch typically idles for a few cycles 
as it can fetch more instructions than can be consumed in a single 
cycle. So, while it’s idling it could be fetching down an alternate 
path. Part of the pipeline would need to be replicated doubling up on 
the size. Then an A/B switch happens which selects the right pipeline. 
Would not want to queue to the reorder buffer from the alternate path, 
as there is a bit of a bottleneck at queue. Not wondering what to do 
about multiple branches. Multiple pipelines and more switches? Front-end 
would look like a pipeline tree to handle multiple outstanding branches.

Was wondering what to do with the extra fetch bandwidth. Fetching two 
cache-lines at once means there may have been up to 21 instructions 
fetched. But its only a four-wide machine. I was going to try and feed 
multiple cores from the same cache. Core A is performance, core B is 
average, and core C is economy using left-over bandwidth from A and B.

I can code the alternate path fetch up and try it in SIM, but it is too 
large for my FPGA right now. Another config option. Might put the switch 
before the rename stage. Nothing like squeezing a mega-LUT design into 
100k LUTs. Getting a feel for the size of things. A two-wide in-order 
core would easily fit. Even a simple two-wide out-of-order core would 
likely fit, if one stuck to 32-bits and a RISC instruction set. A 
four-wide OoO core with lots of features is pushing it.