Article <vp1e4m$1jv4i$1@dont-email.me>

Deutsch English Français Italiano
<vp1e4m$1jv4i$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Tue, 18 Feb 2025 01:50:33 -0600
Organization: A noiseless patient Spider
Lines: 316
Message-ID: <vp1e4m$1jv4i$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
 <voobnc$3l2dl$1@dont-email.me>
 <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
 <vp0m3f$1cth6$1@dont-email.me>
 <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Feb 2025 08:50:47 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1fe6835acbe1e7d2aa43c1dadd73de15";
	logging-data="1703058"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/14+KqSDo3JoL4scxanR132ulCdjw7lEk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:7N5uy4yzuZuxps3H+KVo37Eh4F0=
In-Reply-To: <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
Content-Language: en-US
Bytes: 12860

On 2/17/2025 8:55 PM, MitchAlsup1 wrote:
> On Tue, 18 Feb 2025 1:00:18 +0000, BGB wrote:
> 
>> On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
> ------------
>>> It would take LESS total man-power world-wide and over-time to
>>> simply make HW perform misaligned accesses.
>>
>>
>> I think the usual issue is that on low-end hardware, it is seen as
>> "better" to skip out on misaligned access in order to save some cost in
>> the L1 cache.
>>
>> Though, not sure how this mixes with 16/32 ISAs, given if one allows
>> misaligned 32-bit instructions, and a misaligned 32-bit instruction to
>> cross a cache-line boundary, one still has to deal with essentially the
>> same issues.
> 
> Strategy for low end processors::
> a) detect misalignment in AGEN
> b) when misaligned, AGEN takes 2 cycles for the two addresses
> c) when misaligned, DC is accessed twice
> d) When misaligned, LD align is performed twice to merge data
> 

Possibly.

I had done it at basically full speed with sets of even and odd 
addressed cache-lines, but some mechanism to crack the Load/Store into 
two parts internally could be a different strategy.

Possible cracking might only need to be done though if the misaligned 
access also crosses a line boundary.


>> Another related thing I can note is internal store-forwarding within the
>> L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same
>> cache line.
> 
> IMHO:: Low end processors should not be doing ST->LD forwarding.
> 

Possibly true.

This feature adds a bit of cost, and is one of the things I ended up 
needing to turn off in attempts to boost the clock speed to 75MHz.

But, my existing core is currently a little too bulky to try pushing to 
75MHz.

Using staggered stores in prologs and memcpy does significantly decrease 
the performance of disabling this forwarding (but does put some hurt on 
the speed of LZ4 and RP2 decoding).



I am left half-thinking it might make sense to try doing something 
lighter. But, would need to decide on specifics.

A full soft-reboot is unlikely. But, might make sense to design a core 
for a subset of my current design.


One possibility could be to design a 2-wide core around a subset of XG3.
   And, possibly try aiming for a 75MHz target.
   May drop to 32/64 bit instructions and 64-bit fetch.

May not try for RV64G, as some things in RV64G add too much complexity 
and would likely make a 75MHz target harder.

Some things would be TBD, like whether to stay with full 
compare-and-branch, or drop back to cheaper 
compare-with-zero-and-branch. Would likely (once again) axe some things 
that needed to be added for RV64G support (but which remain debatable in 
terms of hardware cost, 1).

1: Say, for example, 64-bit integer multiply and divide.
It being cheaper to do a 64-bit CPU but only provide a 32-bit multiplier 
(falling back to software for 64-bit multiply).


XG2 is also possible, but arguably, XG3 does have a cleaner encoding 
scheme. Currently, either can be decoded in terms of the other, but 
there are some amount of special cases (and it might be cleaner to 
switch to XG3 as the native encoding scheme).



I guess another open question is if there is a way to make my Binary64 
FPU cheaper and with less timing impact. Not sure, it was already a bit 
of an exercise in corner cutting.


There is also an idle thought of trying to lengthen the pipeline enough 
to allow fully pipelined FPU ops. But, the issue is doing so cheaply 
(and without negatively effecting the cost of branch-predictor misses).

Say: PF IF ID RF E1 E2 E3 E4 E5 E6 WB

Would have steeper cost and increased branch latency.

Though, one could possibly only allow forwarding from certain stages, 
say: E2, E3, and E5

Whereas, if the result is in E1 or E4, it generates an interlock stall, 
and E6 stalls until WB completes (may or may not allow forwarding from 
WB). Though, possibly, there could be "pseudo-forwarding" from E4/E5/E6, 
where if an instruction completed in a prior stage, these stages can 
still forward the result, but no new results may "arrive" at these 
stages (dunno how much difference this would make for forwarding cost, 
could still be expensive to have this many EX stages).


Dropping EX1, as-is, mostly effects the performance of Reg-Reg and 
Imm-Reg MOV (pretty much everything else of note already has a 2-cycle 
latency), but these instructions are more sensitive to latency (so, 2 
cycle MOV is not ideal).


With 6 pipeline stages, this could be enough to allow pipelining a 
Binary64 FMUL or FADD, or a Binary32 FMAC.

But, would mean 13 cycle branch miss, ... And possibly also turn the CPU 
into a turd.


Another option could be keep 3 primary EX stages, but have mechanism for 
registers to be marked as "not yet available" and then to allow longer 
latency operations to finish at some later stage.

Some cores I had looked at had done this (for things like memory 
accesses, which were put into a FIFO), but this leaves the issue of how 
to best get results back into the register file (don't want to be 
handing out register file write ports to function-units, and there is an 
issue that there is a high probability of multiply FU's wanting to 
submit results at the same time, which would need to be dealt with).

Best option I can think of is that these FUs have a mechanism to hold 1 
or 2 values, and a mechanism exists to MUX these over a shared write 
port, generating pipeline stalls if the port gets backlogged. But, this 
seems like it would suck.

Moving instructions along one stage at a time, and then having the final 
value appear on the pipeline (for be forwarded back to RF, or eventually 
reach WB), is cleaner and simpler.

Nevermind the issue of needing to stall the pipeline whenever the L1 
cache misses or similar.

....


But, I guess the more immediate question would be more of coming up with 
something that has a decent/fast ISA, can run at 75MHz, and fit more 
easily onto an XC7S50 or similar.

Though, the most conservative option is to keep a design similar to my 
existing core, just try to strip it down a fair bit.



> ---------------------
>>
>> Say, it less convoluted to do, say:
>>    MOV.X  R24, (SP, 0)
>>    MOV.X  R26, (SP, 16)
>>    MOV.X  R28, (SP, 32)
>>    MOV.X  R30, (SP, 48)
> 
> These still look like LDs to me.
> 

My ASM notation is "OP Src, Dst".
   Which is, granted, backwards from Intel and RV notation.
========== REMAINDER OF ARTICLE TRUNCATED ==========