Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB-Alt Newsgroups: comp.arch Subject: Re: Microarch Club Date: Mon, 25 Mar 2024 16:48:43 -0500 Organization: A noiseless patient Spider Lines: 61 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 25 Mar 2024 22:48:45 +0100 (CET) Injection-Info: dont-email.me; posting-host="45f9dcc3d866b8e585773ea070097f8e"; logging-data="1416394"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18bz0HGcxFm+WzFOMew2Ad6wyHg+AT3KWo=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:WvR0yJHqPMenHiLJKKSUlikYxt0= In-Reply-To: Content-Language: en-US Bytes: 3915 On 3/21/2024 2:34 PM, George Musk wrote: > Thought this may be interesting: > https://microarch.club/ > https://www.youtube.com/@MicroarchClub/videos At least sort of interesting... I guess one of the guys on there did a manycore VLIW architecture with the memory local to each of the cores. Seems like an interesting approach, though not sure how well it would work on a general purpose workload. This is also closer to what I had imagined when I first started working on this stuff, but it had drifted more towards a slightly more conventional design. But, admittedly, this is for small-N cores, 16/32K of L1 with a shared L2, seemed like a better option than cores with a very large shared L1 cache. I am not sure that abandoning a global address space is such a great idea, as a lot of the "merits" can be gained instead by using weak coherence models (possibly with a shared 256K or 512K or so for each group of 4 cores, at which point it goes out to a higher latency global bus). In this case, the division into independent memory regions could be done in software. It is unclear if my approach is "sufficiently minimal". There is more complexity than I would like in my ISA (and effectively turning it into the common superset of both my original design and RV64G, doesn't really help matters here). If going for a more minimal core optimized for perf/area, some stuff might be dropped. Would likely drop integer and floating-point divide again. Might also make sense to add an architectural zero register, and eliminate some number of encodings which exist merely because of the lack of a zero register (though, encodings are comparably cheap, as the internal uArch has a zero register, and effectively treats immediate values as a special register as well, ...). Some of the debate is more related to the logic cost of dealing with some things in the decoder. Though, would likely still make a few decisions differently from those in RISC-V. Things like indexed load/store, predicated ops (with a designated flag bit), and large-immediate encodings, help enough with performance (relative to cost) to be worth keeping (though, mostly because the alternatives are not so good in terms of performance). Staying with 3-wide also makes sense, as going to 1-wide or 2-wide will not save much if one still wants things like FP-SIMD and unaligned memory access (when as-is, all the 3rd lane can really do is basic ALU ops and similar; and the savings are negligible if one has a register file with 6-ports, which in turn is needed to fully provision stuff for the 2-wide cases, ...). Practically the manycore idea doesn't go very far on FPGAs, as to have "useful" cores, one can't fit more than a few cores on any of the "affordable" FPGAs... But, maybe other people will have different ideas. ....