Deutsch English Français Italiano |
<1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Misc: BGBCC targeting RV64G, initial results... Date: Fri, 27 Sep 2024 19:40:32 +0000 Organization: Rocksolid Light Message-ID: <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org> References: <vd5uvd$mdgn$1@dont-email.me> <vd69n0$o0aj$1@dont-email.me> <vd6tf8$r27h$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="3761310"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$Xte/ORLphpZ32VcQojVv3eISG0rKazTfXYGn9GuLaovhf5in8onJi X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 Bytes: 6360 Lines: 154 On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote: > On 9/27/2024 7:50 AM, Robert Finch wrote: >> On 2024-09-27 5:46 a.m., BGB wrote: >--------- > > But, BJX2 does not spam the ADD instruction quite so hard, so is more > forgiving of latency. In this case, an optimization that reduces > common-case ADD to 1 cycle was being used (it only works though in the > CPU core if the operands are both in signed 32-bit range and no overflow > occurs; IIRC optionally using a sign-extended AGU output as a stopgap > ALU output before the output arrives from the main ALU the next cycle). > RISC-V group opinion is that "we have done nothing to damage pipeline operating frequency". {{Except the moving of register specifier fields between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN in 2 cycles, and several others...}} > >> >>> Comparably, it appears BGBCC leans more heavily into ADD and SLLI than >>> GCC does, with a fair chunk of the total instructions executed being >>> these two (more cycles are spent adding and shifting than doing memory >>> load or store...). >> >> That seems to be a bit off. Mem ops are usually around 1/4 of Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA such that your typical memref needs a support instruction. >> instructions. Spending more than 25% on adds and shifts seems like a >> lot. Is it address calcs? Register loads of immediates? >> > > It is both... > > > In BJX2, the dominant instruction tends to be memory Load. > Typical output from BGBCC for Doom is (at runtime): > ~ 70% fixed-displacement; > ~ 30% register-indexed. > Static output differs slightly: > ~ 84% fixed-displacement; > ~ 16% register-indexed. > > RV64G lacks register-indexed addressing, only having fixed displacement. > > If you need to do a register-indexed load in RV64: > SLLI X5, Xo, 2 //shift by size of index > ADD X5, Xm, X5 //add base and index > LW Xn, X5, 0 //do the load > > This case is bad... Which makes that 16% (above) into 48% and renormalizing to:: ~ 63% fixed-displacement; ~ 36% register-indexed and support instructions. > > > Also global variables outside the 2kB window: > LUI X5, DispHi > ADDI X5, X5, DispLo > ADD X5, GP, X5 > LW Xn, X5, 0 > > Where, sorting global variables by usage priority gives: > ~ 35%: in range > ~ 65%: not in range Illustrating the falicy of 12-bits of displacement. > Comparably, XG2 has a 16K or 32K reach here (depending on immediate > size), which hits most of the global variables. The fallback Jumbo > encoding hits the rest. I get ±32K with 16-bit displacements > > Theoretically, could save 1 instruction here, but would need to add two > more reloc types to allow for: > LUI, ADD, Lx > LUI, ADD, Sx > Because annoyingly Load and Store have different displacement encodings; > and I still need the base form for other cases. > > > More compact way to load/store global variables would be to use absolute > 32-bit or PC relative: > LUI + Lx/Sx : Abs32 > AUIPC + Lx/Sx : PC-Rel32 MEM Rd,[IP,,DISP32/64] // IP-rel ----- > > Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 > (there does seem to be some interest for ELF FDPIC but limited to 32-bit > RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off > from PBO (namely, using GP for a global section and then chaining the > sections for each binary). How are you going to do dense PIC switch() {...} in RISC-V ?? > Main difference being that FDPIC uses fat > function pointers and does the GP reload on the caller, vs PBO where I > use narrow function pointers and do the reload on the callee (with > load-time fixups for the PBO Offset). > > > The result of all this is a whole lot of unnecessary > Shifts and ADDs. > Seemingly, even more for BGBCC than for GCC, which already had a lot of > shifts and adds. > > BGBCC basically entirely dethrowns the Load and Store ops ... > > > Possibly more so than GCC, which tended to turn most constant loads into > memory loads. It would load a table of constants into a register and > then pull constants from the table, rather than compose them inline. > > Say, something like: > AUIPC X18, X18, DispHi > ADD X18, X18, DispLo > (X18 now holds a table of constants, pointing into .rodata) > > And, when it needs a constant: > LW Xn, X18, Disp //offset of the constant it wants. > Or: > LD Xn, X18, Disp //64-bit constant > > > Currently, BGBCC does not use this strategy. > Though, for 64-bit constants it could be more compact and faster. > > But, better still would be having Jumbo prefixes or similar, or even a > SHORI instruction. Better Still Still is having 32-bit and 64-bit constants available from the instruction stream and positioned in either operand position. > Say, 64-bit constant-load in SH-5 or similar: > xxxxyyyyzzzzwwww > MOV ImmX, Rn > SHORI ImmY, Rn > SHORI ImmZ, Rn > SHORI ImmW, Rn > Where, one loads the constant in 16-bit chunks. Yech >> >> Don't you ever snip anything ??