Article <uttfk3$1j3o3$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <uttfk3$1j3o3$1@dont-email.me>
Deutsch English Français Italiano
<uttfk3$1j3o3$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Mon, 25 Mar 2024 22:32:13 -0500
Organization: A noiseless patient Spider
Lines: 276
Message-ID: <uttfk3$1j3o3$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
 <utsrft$1b76a$1@dont-email.me>
 <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 26 Mar 2024 04:32:20 +0100 (CET)
Injection-Info: dont-email.me; posting-host="fa3546eee5471627f4b184ff03e68974";
	logging-data="1675011"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX185V4lZ7w/Bh+QL7n78TOjsz+NBcunrftA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ARkiN822WxqClDyhJr9/pieYfKA=
In-Reply-To: <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org>
Content-Language: en-US
Bytes: 11189

On 3/25/2024 5:17 PM, MitchAlsup1 wrote:
> BGB-Alt wrote:
> 
>> On 3/21/2024 2:34 PM, George Musk wrote:
>>> Thought this may be interesting:
>>> https://microarch.club/
>>> https://www.youtube.com/@MicroarchClub/videos
> 
>> At least sort of interesting...
> 
>> I guess one of the guys on there did a manycore VLIW architecture with 
>> the memory local to each of the cores. Seems like an interesting 
>> approach, though not sure how well it would work on a general purpose 
>> workload. This is also closer to what I had imagined when I first 
>> started working on this stuff, but it had drifted more towards a 
>> slightly more conventional design.
> 
> 
>> But, admittedly, this is for small-N cores, 16/32K of L1 with a shared 
>> L2, seemed like a better option than cores with a very large shared L1 
>> cache.
> 
> You appear to be "starting to get it"; congratulations.
> 

I had experimented with stuff before, and "big L1 caches" seemed to be 
in most regards worse. Hit rate goes into diminishing return territory, 
and timing isn't too happy either.

At least for my workloads, 32K seemed like the local optimum.

Say, checking hit rates (in Doom):
     8K: 67%,  16K: 78%,
    32K: 85%,  64K: 87%
   128K: 88%
This being for a direct-mapped cache configuration with even/odd paired 
16-byte cache lines.

Other programs seem similar.


For a direct-mapped L1 cache, there is an issue with conflict misses, 
where I was able to add in a small cache to absorb ~ 1-2% that was due 
to conflict misses, which also had the (seemingly more obvious) effect 
of reducing L2 misses (from a direct-mapped L2 cache). Though, it is 
likely that a set-associative L2 cache could have also addressed this 
issue (but likely with a higher cost impact).



>> I am not sure that abandoning a global address space is such a great 
>> idea, as a lot of the "merits" can be gained instead by using weak 
>> coherence models (possibly with a shared 256K or 512K or so for each 
>> group of 4 cores, at which point it goes out to a higher latency 
>> global bus). In this case, the division into independent memory 
>> regions could be done in software.
> 
> Most of the last 50 years has been towards a single global address space.
> 

Yeah.

 From what I can gather, the guy in the video had an architecture which 
gives each CPU its own 128K and needs explicit message passing to access 
outside of this (and faking a global address space in software, at a 
significant performance penalty). As I see it, this does not seem like 
such a great idea...


Something like weak coherence can get most of the same savings, with 
much less impact on how one writes code (albeit, it does mean that mutex 
locking may still be painfully slow).

But, this does mean it is better to try to approach software in a way 
that neither requires TSO semantics nor frequent mutex locking.


>> It is unclear if my approach is "sufficiently minimal". There is more 
>> complexity than I would like in my ISA (and effectively turning it 
>> into the common superset of both my original design and RV64G, doesn't 
>> really help matters here).
> 
>> If going for a more minimal core optimized for perf/area, some stuff 
>> might be dropped. Would likely drop integer and floating-point divide 
> 
> I think this is pound foolish even if penny wise.
> 

The "shift and add" unit isn't free, and the relative gains are small.
For integer divide, granted, it is faster than the pure software version 
in the general case. For FPU divide, N-R is faster, but shift-add can 
give an exact result. Most other / faster hardware divide strategies 
seem to be more expensive than a shift-and-add unit.


My confidence in hardware divide isn't too high, noting for example that 
the AMD K10 and Bulldozer/15h had painfully slow divide operations (to 
such a degree that doing it in software was often faster). This implies 
that divide cost/performance is still not really a "solved" issue, even 
if one has the resources to throw at it.

One can avoid the cost of the shift-and-add unit via "trap and emulate", 
but then the performance is worse.


Say, "we have an instruction, but it is a boat anchor" isn't an ideal 
situation (unless to be a placeholder for if/when it is not a boat anchor).


>> again. Might also make sense to add an architectural zero register, 
>> and eliminate some number of encodings which exist merely because of 
>> the lack of a zero register (though, encodings are comparably cheap, 
>> as the 
> 
> I got an effective zero register without having to waste a register name 
> to "get it". My 66000 gives you 32 registers of 64-bits each and you can 
> put any bit pattern in any register and treat it as you like.
> Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally
> available.
> 

I guess offloading this to the compiler can also make sense.

Least common denominator would be, say, not providing things like NEG 
instructions and similar (pretending as-if one had a zero register), and 
if a program needs to do a NEG or similar, it can load 0 into a register 
itself.

In the extreme case (say, one also lacks a designated "load immediate" 
instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to 
zero a register...


Say:
   XOR R14, R14, R14  //Designate R14 as pseudo-zero...
   ...
   ADD R14, 0x123, R8  //Load 0x123 into R8

Though, likely still makes sense in this case to provide some 
"convenience" instructions.


>> internal uArch has a zero register, and effectively treats immediate 
>> values as a special register as well, ...). Some of the debate is more 
>> related to the logic cost of dealing with some things in the decoder.
> 
> The problem is universal constants. RISCs being notably poor in their
> support--however this is better than addressing modes which require
> µCode.
> 

Yeah.

I ended up with jumbo-prefixes. Still not perfect, and not perfectly 
orthogonal, but mostly works.

Allows, say:
   ADD R4, 0x12345678, R6

To be performed in potentially 1 clock-cycle and with a 64-bit encoding, 
which is better than, say:
   LUI X8, 0x12345
   ADD X8, X8, 0x678
   ADD X12, X10, X8


Though, for jumbo-prefixes, did end up adding a special case in the 
compile where it will try to figure out if a constant will be used 
multiple times in a basic-block and, if so, will load it into a register 
rather than use a jumbo-prefix form.


It could maybe make sense to have function-scale static-assigned 
constants, but have not done so yet.

Though, it appears as if one of the "top contenders" here would be 0, 
mostly because things like:
========== REMAINDER OF ARTICLE TRUNCATED ==========