Deutsch English Français Italiano |
<uttfk3$1j3o3$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Microarch Club Date: Mon, 25 Mar 2024 22:32:13 -0500 Organization: A noiseless patient Spider Lines: 276 Message-ID: <uttfk3$1j3o3$1@dont-email.me> References: <uti24p$28fg$1@nnrp.usenet.blueworldhosting.com> <utsrft$1b76a$1@dont-email.me> <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 26 Mar 2024 04:32:20 +0100 (CET) Injection-Info: dont-email.me; posting-host="fa3546eee5471627f4b184ff03e68974"; logging-data="1675011"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185V4lZ7w/Bh+QL7n78TOjsz+NBcunrftA=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:ARkiN822WxqClDyhJr9/pieYfKA= In-Reply-To: <80b47109a4c8c658ca495b97b9b10a54@www.novabbs.org> Content-Language: en-US Bytes: 11189 On 3/25/2024 5:17 PM, MitchAlsup1 wrote: > BGB-Alt wrote: > >> On 3/21/2024 2:34 PM, George Musk wrote: >>> Thought this may be interesting: >>> https://microarch.club/ >>> https://www.youtube.com/@MicroarchClub/videos > >> At least sort of interesting... > >> I guess one of the guys on there did a manycore VLIW architecture with >> the memory local to each of the cores. Seems like an interesting >> approach, though not sure how well it would work on a general purpose >> workload. This is also closer to what I had imagined when I first >> started working on this stuff, but it had drifted more towards a >> slightly more conventional design. > > >> But, admittedly, this is for small-N cores, 16/32K of L1 with a shared >> L2, seemed like a better option than cores with a very large shared L1 >> cache. > > You appear to be "starting to get it"; congratulations. > I had experimented with stuff before, and "big L1 caches" seemed to be in most regards worse. Hit rate goes into diminishing return territory, and timing isn't too happy either. At least for my workloads, 32K seemed like the local optimum. Say, checking hit rates (in Doom): 8K: 67%, 16K: 78%, 32K: 85%, 64K: 87% 128K: 88% This being for a direct-mapped cache configuration with even/odd paired 16-byte cache lines. Other programs seem similar. For a direct-mapped L1 cache, there is an issue with conflict misses, where I was able to add in a small cache to absorb ~ 1-2% that was due to conflict misses, which also had the (seemingly more obvious) effect of reducing L2 misses (from a direct-mapped L2 cache). Though, it is likely that a set-associative L2 cache could have also addressed this issue (but likely with a higher cost impact). >> I am not sure that abandoning a global address space is such a great >> idea, as a lot of the "merits" can be gained instead by using weak >> coherence models (possibly with a shared 256K or 512K or so for each >> group of 4 cores, at which point it goes out to a higher latency >> global bus). In this case, the division into independent memory >> regions could be done in software. > > Most of the last 50 years has been towards a single global address space. > Yeah. From what I can gather, the guy in the video had an architecture which gives each CPU its own 128K and needs explicit message passing to access outside of this (and faking a global address space in software, at a significant performance penalty). As I see it, this does not seem like such a great idea... Something like weak coherence can get most of the same savings, with much less impact on how one writes code (albeit, it does mean that mutex locking may still be painfully slow). But, this does mean it is better to try to approach software in a way that neither requires TSO semantics nor frequent mutex locking. >> It is unclear if my approach is "sufficiently minimal". There is more >> complexity than I would like in my ISA (and effectively turning it >> into the common superset of both my original design and RV64G, doesn't >> really help matters here). > >> If going for a more minimal core optimized for perf/area, some stuff >> might be dropped. Would likely drop integer and floating-point divide > > I think this is pound foolish even if penny wise. > The "shift and add" unit isn't free, and the relative gains are small. For integer divide, granted, it is faster than the pure software version in the general case. For FPU divide, N-R is faster, but shift-add can give an exact result. Most other / faster hardware divide strategies seem to be more expensive than a shift-and-add unit. My confidence in hardware divide isn't too high, noting for example that the AMD K10 and Bulldozer/15h had painfully slow divide operations (to such a degree that doing it in software was often faster). This implies that divide cost/performance is still not really a "solved" issue, even if one has the resources to throw at it. One can avoid the cost of the shift-and-add unit via "trap and emulate", but then the performance is worse. Say, "we have an instruction, but it is a boat anchor" isn't an ideal situation (unless to be a placeholder for if/when it is not a boat anchor). >> again. Might also make sense to add an architectural zero register, >> and eliminate some number of encodings which exist merely because of >> the lack of a zero register (though, encodings are comparably cheap, >> as the > > I got an effective zero register without having to waste a register name > to "get it". My 66000 gives you 32 registers of 64-bits each and you can > put any bit pattern in any register and treat it as you like. > Accessing #0 takes 1/16 of a 5-bit encoding space, and is universally > available. > I guess offloading this to the compiler can also make sense. Least common denominator would be, say, not providing things like NEG instructions and similar (pretending as-if one had a zero register), and if a program needs to do a NEG or similar, it can load 0 into a register itself. In the extreme case (say, one also lacks a designated "load immediate" instruction or similar), there is still the "XOR Rn, Rn, Rn" strategy to zero a register... Say: XOR R14, R14, R14 //Designate R14 as pseudo-zero... ... ADD R14, 0x123, R8 //Load 0x123 into R8 Though, likely still makes sense in this case to provide some "convenience" instructions. >> internal uArch has a zero register, and effectively treats immediate >> values as a special register as well, ...). Some of the debate is more >> related to the logic cost of dealing with some things in the decoder. > > The problem is universal constants. RISCs being notably poor in their > support--however this is better than addressing modes which require > µCode. > Yeah. I ended up with jumbo-prefixes. Still not perfect, and not perfectly orthogonal, but mostly works. Allows, say: ADD R4, 0x12345678, R6 To be performed in potentially 1 clock-cycle and with a 64-bit encoding, which is better than, say: LUI X8, 0x12345 ADD X8, X8, 0x678 ADD X12, X10, X8 Though, for jumbo-prefixes, did end up adding a special case in the compile where it will try to figure out if a constant will be used multiple times in a basic-block and, if so, will load it into a register rather than use a jumbo-prefix form. It could maybe make sense to have function-scale static-assigned constants, but have not done so yet. Though, it appears as if one of the "top contenders" here would be 0, mostly because things like: ========== REMAINDER OF ARTICLE TRUNCATED ==========