Article <v88gru$ij11$1@dont-email.me>

Deutsch English Français Italiano
<v88gru$ij11$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Mon, 29 Jul 2024 11:43:39 -0500
Organization: A noiseless patient Spider
Lines: 128
Message-ID: <v88gru$ij11$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at> <v811ub$309dk$1@dont-email.me>
 <2024Jul29.145933@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 29 Jul 2024 18:43:43 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c8aa09e2d45f8fb38190453b39c47ea3";
	logging-data="609313"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX191Dbp8OFgzBohQ9TSNuPWx9BsJYMC/rLQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YKeLpFBzPN0QeK+Lh436IenANCA=
In-Reply-To: <2024Jul29.145933@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 6571

On 7/29/2024 7:59 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> On 7/26/2024 12:00 PM, Anton Ertl wrote:
>>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>>> and it's more efficient
>>>
>>> That depends on the hardware.
>>>
>>> Yes, the Alpha 21164 with its imprecise exceptions was "more
>>> efficient" than other hardware for a while, then the Pentium Pro came
>>> along and gave us precise exceptions and more efficiency.  And
>>> eventually the Alpha people learned the trick, too, and 21264 provided
>>> precise exceptions (although they did not admit this) and more
>>> efficieny.
>>>
>>> Similarly, I expect that hardware that is designed for good TSO or
>>> sequential consistency performance will run faster on code written for
>>> this model than code written for weakly consistent hardware will run
>>> on that hardware.  That's because software written for weakly
>>> consistent hardware often has to insert barriers or atomic operations
>>> just in case, and these operations are slow on hardware optimized for
>>> weak consistency.
>>>
>>
>> TSO requires more significant hardware complexity though.
> 
> An efficient implementation of TSO or sequential consistency requires
> more hardware, yes.
> 
> Floating point requires more hardware than fixed point.  Precise
> exceptions require more hardware than imprecise exceptions.  Caches
> require more hardware than the local memory of Cells SPEs.  OoO
> requires more hardware than in-order; in this case the IA-64
> implementations demonstrated that you could then spend the area budget
> on more in-order resources (and big caches) and still fail to keep up
> on SPECint with the smaller OoO competition.  In all these cases we
> decided that the benefit is worth the additional hardware.  I think
> that's the case for strong memory ordering, too.
> 

As noted, I had needed to cut corners in a lot of areas:
   Caches are direct-mapped;
   In-order;
   Floating point is not exact;
   ...

Otherwise, stuff isn't going to fit into the FPGAs.

Something like TSO is a lot of complexity for not much gain.


Contrast, floating point and precise exceptions are a lot more relevant 
to software. Floating point: "float" and "double" are not exactly rare, 
and performing like crap isn't ideal.

Precise exceptions: Otherwise one can't do instruction emulation traps, 
or software managed TLB (granted, software-managed TLB is itself a form 
of corner cutting).


As noted, I had found associative caches mostly not worthwhile.
   So, L1 caches ended up as direct mapped;
   L2 is also direct-mapped.

Did end up adding a smaller 4-way cache (VCA Cache) between the L1 and 
L2 caches, which mostly keeps track of stored lines from the L1 and 
fetched lines from the L2 and absorbing a lot of the L1 conflict misses. 
The main effect it has is (seemingly) causing a notable reduction in the 
number of L2 misses. Had experimented with 8-way, but 8-way was too 
expensive. Also it is Write-Through rather than Write-Back.

This cache is 64x 4-way, or ~ 4K. So, say:
   L1 D$,  32K DM WB VIVT
   L1 I$,  16K DM WB VIVT
   VCA  ,   4K 4W WT PIPT
   L2   , 256K DM WB PIPT

With VCA, associativity mattered more than total size, but 32 or 64 rows 
did notably better than merely having 4 or 8 cache lines (fully 
associative), without that much difference in cost (the associativity 
costs a lot more than the LUTRAM in this case). But, unlike with purely 
DM caches, "cache knocking" is no longer particularly effective (doing 
trickery with addresses to knock things out of cache, only working 
effectively with direct-mapped caches; so it is pros/cons here).

But, can note that FPGAs have relatively expensive logic and relatively 
cheap SRAM (or, apparently, the inverse of ASICs).



Though, in contrast to my initial estimates, I did manage to figure out 
a way to add bank-switched the GPRs without blowing out the resource 
budget or timing.

But, as for whether it would also be viable for an ASIC core, dunno.
   Would require around 2kB of SRAM for the mechanism as it exists.
   Granted, this is smaller than the typical L1 caches.

Still TBD whether it "actually makes sense"...


>> Seems like it would be harder to debug the hardware since:
>>    There is more that has to go on in the hardware for TSO to work;
>>    Software will have higher expectations that it actually work.
> 
> Possible.  Delivering working hardware is the job of hardware
> engineers.  Intel and AMD apparently have no problems getting the TSO
> parts of their architectures right.  However, it seems that they don't
> go for "really efficient" TSO, or they would just upgrade the parts of
> their architecture with weaker consistency to have TSO.
> 

Yeah, but for a hobbyist this will be more of an issue...


Similar likely for microcontrollers (if relevant), embedded CPUs, 
manycore systems, or systems with high-latency links (such as over 
Ethernet and TCP/IP). The cost of TSO likely isn't worth it.

Seemingly (looking at charts), ARM and POWER didn't find it worthwhile.
For RISC-V, it is an optional extension (weak is the assumed default).


To some extent, it is mostly an x86 and x86-64 thing...


> - anton