Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB-Alt <bohannonindustriesllc@gmail.com>
Newsgroups: comp.arch
Subject: Re: Microarch Club
Date: Mon, 25 Mar 2024 16:48:43 -0500
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <utsrft$1b76a$1@dont-email.me>
References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 25 Mar 2024 22:48:45 +0100 (CET)
Injection-Info: dont-email.me; posting-host="45f9dcc3d866b8e585773ea070097f8e";
	logging-data="1416394"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18bz0HGcxFm+WzFOMew2Ad6wyHg+AT3KWo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:WvR0yJHqPMenHiLJKKSUlikYxt0=
In-Reply-To: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com>
Content-Language: en-US
Bytes: 3915

On 3/21/2024 2:34 PM, George Musk wrote:
> Thought this may be interesting:
> https://microarch.club/
> https://www.youtube.com/@MicroarchClub/videos

At least sort of interesting...

I guess one of the guys on there did a manycore VLIW architecture with 
the memory local to each of the cores. Seems like an interesting 
approach, though not sure how well it would work on a general purpose 
workload. This is also closer to what I had imagined when I first 
started working on this stuff, but it had drifted more towards a 
slightly more conventional design.


But, admittedly, this is for small-N cores, 16/32K of L1 with a shared 
L2, seemed like a better option than cores with a very large shared L1 
cache.

I am not sure that abandoning a global address space is such a great 
idea, as a lot of the "merits" can be gained instead by using weak 
coherence models (possibly with a shared 256K or 512K or so for each 
group of 4 cores, at which point it goes out to a higher latency global 
bus). In this case, the division into independent memory regions could 
be done in software.

It is unclear if my approach is "sufficiently minimal". There is more 
complexity than I would like in my ISA (and effectively turning it into 
the common superset of both my original design and RV64G, doesn't really 
help matters here).

If going for a more minimal core optimized for perf/area, some stuff 
might be dropped. Would likely drop integer and floating-point divide 
again. Might also make sense to add an architectural zero register, and 
eliminate some number of encodings which exist merely because of the 
lack of a zero register (though, encodings are comparably cheap, as the 
internal uArch has a zero register, and effectively treats immediate 
values as a special register as well, ...). Some of the debate is more 
related to the logic cost of dealing with some things in the decoder.

Though, would likely still make a few decisions differently from those 
in RISC-V. Things like indexed load/store, predicated ops (with a 
designated flag bit), and large-immediate encodings, help enough with 
performance (relative to cost) to be worth keeping (though, mostly 
because the alternatives are not so good in terms of performance).

Staying with 3-wide also makes sense, as going to 1-wide or 2-wide will 
not save much if one still wants things like FP-SIMD and unaligned 
memory access (when as-is, all the 3rd lane can really do is basic ALU 
ops and similar; and the savings are negligible if one has a register 
file with 6-ports, which in turn is needed to fully provision stuff for 
the 2-wide cases, ...).

Practically the manycore idea doesn't go very far on FPGAs, as to have 
"useful" cores, one can't fit more than a few cores on any of the 
"affordable" FPGAs...

But, maybe other people will have different ideas.

....