Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Misc: BGBCC targeting RV64G, initial results... Date: Sun, 29 Sep 2024 21:19:41 -0500 Organization: A noiseless patient Spider Lines: 214 Message-ID: References: <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org> <58bd95eee31b53933be111d0d941203a@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Mon, 30 Sep 2024 04:19:45 +0200 (CEST) Injection-Info: dont-email.me; posting-host="fa7e9158981d314e6d43ace5c5dec1b4"; logging-data="2193204"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/NDlk29VrA5Mx499iCHLHyFHUGO49d3fI=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:RzKqgF/qKVWtehKdhVL1s2lcduE= Content-Language: en-US In-Reply-To: <58bd95eee31b53933be111d0d941203a@www.novabbs.org> Bytes: 8096 On 9/29/2024 2:11 PM, MitchAlsup1 wrote: > On Sat, 28 Sep 2024 4:30:12 +0000, BGB wrote: > >> On 9/27/2024 7:43 PM, MitchAlsup1 wrote: >>> On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote: >>> >>> One of the reasons reservation stations became in vouge. >>> >> >> Possibly, but is a CPU feature rather than a compiler feature... > > A good compiler should be able to make use of 98% of the instruction > set. Yes, but a reservation station is not part of the ISA proper... >> > ------------ >> >> Saw a video not too long ago where he was making code faster by undoing >> a lot of loop unrolling, as the code was apparently spending more in I$ >> misses than it was gaining by being unrolled. > > I noticed this in 1991 when we got Mc88120 simulator up and running. > GBOoO chips are best served when there is the smallest number > of instructions. Looking it up, seems the CPU in question (MIPS R4300) was: 16K L1 I$ cache; 8K L1 D$ cache; No L2 cache (but could be supported off-die); 1-wide scalar, 32 or 64 bit Non pipelined FPU and multiplier; ... Oddly, some amount of these older CPUs seem to have larger I$ than D$, whereas IME the D$ seems to have a higher miss rate (so is easier to justify it being bigger). >> ------------ >> >> In contrast, a jumbo prefix by itself does not make sense; its meaning >> depends on the thing that being is prefixed. Also the decoder will >> decode a jumbo prefix and suffix instruction at the same time. > > How many bits does one of these jumbo prefixes consume ? The prefix itself is 32 bits. In the context of XG3, it supplies 23 or 27 bits. For RISC-V ops, they could supply 21 or 26 bits. 23+10 = 33 (XG3) 21+12 = 33 (RV op) 27+27+10 = 64 (XG3) 26+26+12 = 64 (RV op) J27 could synthesize an immediate for non-immediate ops: 27+6 = 33 (XG3) 27+5 = 32 (RV) For BJX2, the prefixes supply 24 bits (can be stretched to 27 bits in XG2). 24+ 9/10=33 (Base) 24+24+16=64 (Base) 27+27+10=64 (XG2) But, yeah, perhaps unsurprisingly, the RISC-V people are not so optimistic about the idea of jumbo prefixes... Also apparently it seems "yeah, here is a prefix whose primary purpose is just to make the immediate field bigger for the following instruction" is not such an obvious or intuitive idea as I had thought. Well, and people obsessing on what happens if an interrupt somehow occurs "between" the prefix and prefixed instruction. Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time. Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order bits of the link register and similar is not such an elegant idea. But, as I see it, still preferable to: Hey, why not just define a bunch of 48-bit encodings for ALU operations with 32-bit immediate fields?... But, like, blarg, this is what I did originally. And, I dropped all this in favor of jumbo prefixes, because jumbo prefixes did this job better. Might still experiment with an "Extended RISC-V" and see if in-fact, adding things like jumbo prefixes will make as much of a difference as I expect. Well, probably along with indexed load/store and Zba instructions and similar. I guess, an open question would be if a modified RISC-V variant could be made more performance-competitive with BJX2 without making too much of a mess of things. I could maybe do so, but probably no one would be interested. Though, looking online, seems I am really the only one calling them "jumbo prefixes". Not sure if there is another more common term used for these things. > ----- >> >> >> For the jumbo prefix: >>    Recognize that is a jumbo prefix; >>    Inform the decoder for the following instruction of this fact >>      (via internal flag bits); >>    Provide the prefix's data bits to the corresponding decoder. >> >> Unlike a "real" instruction, a jumbo prefix does not need to provide >> behavior of its own, merely be able to be identified as such and to >> provide payload data bits. >> >> >> For now, there are not any encodings larger than 96 bits. >> Partly this is because 128 bit fetch would likely add more cost and >> complexity than it is worth at the moment. > > For your implementation, yes. For all others:: maybe. Maybe. I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case. >> >> >>>> >>>>>> >>>>>> Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 >>>>>> (there does seem to be some interest for ELF FDPIC but limited to >>>>>> 32-bit >>>>>> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too >>>>>> far off >>>>>> from PBO (namely, using GP for a global section and then chaining the >>>>>> sections for each binary). >>>>> >>>>> How are you going to do dense PIC switch() {...} in RISC-V ?? >>>> >>>> Already implemented... >>>>>>> With pseudo-instructions: >>>>     SUB Rs, $(MIN), R10 >>>>     MOV $(MAX-MIN), R11 >>>>     BGTU R11, R10, Lbl_Dfl >>>> >>>>     MOV   .L0, R6      //AUIPC+ADD ========== REMAINDER OF ARTICLE TRUNCATED ==========