Article <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>

Deutsch English Français Italiano

<1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Misc: BGBCC targeting RV64G, initial results...
Date: Fri, 27 Sep 2024 19:40:32 +0000
Organization: Rocksolid Light
Message-ID: <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>
References: <vd5uvd$mdgn$1@dont-email.me> <vd69n0$o0aj$1@dont-email.me> <vd6tf8$r27h$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="3761310"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Xte/ORLphpZ32VcQojVv3eISG0rKazTfXYGn9GuLaovhf5in8onJi
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Bytes: 6360
Lines: 154

On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:

> On 9/27/2024 7:50 AM, Robert Finch wrote:
>> On 2024-09-27 5:46 a.m., BGB wrote:
>---------
>
> But, BJX2 does not spam the ADD instruction quite so hard, so is more
> forgiving of latency. In this case, an optimization that reduces
> common-case ADD to 1 cycle was being used (it only works though in the
> CPU core if the operands are both in signed 32-bit range and no overflow
> occurs; IIRC optionally using a sign-extended AGU output as a stopgap
> ALU output before the output arrives from the main ALU the next cycle).
>
RISC-V group opinion is that "we have done nothing to damage pipeline
operating frequency". {{Except the moving of register specifier fields
between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN
in 2 cycles, and several others...}}
>
>>
>>> Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
>>> GCC does, with a fair chunk of the total instructions executed being
>>> these two (more cycles are spent adding and shifting than doing memory
>>> load or store...).
>>
>> That seems to be a bit off. Mem ops are usually around 1/4 of

Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.

>> instructions. Spending more than 25% on adds and shifts seems like a
>> lot. Is it address calcs? Register loads of immediates?
>>
>
> It is both...
>
>
> In BJX2, the dominant instruction tends to be memory Load.
>    Typical output from BGBCC for Doom is (at runtime):
>      ~ 70% fixed-displacement;
>      ~ 30% register-indexed.
>    Static output differs slightly:
>      ~ 84% fixed-displacement;
>      ~ 16% register-indexed.
>
> RV64G lacks register-indexed addressing, only having fixed displacement.
>
> If you need to do a register-indexed load in RV64:
>    SLLI  X5, Xo, 2  //shift by size of index
>    ADD X5, Xm, X5  //add base and index
>    LW  Xn, X5, 0   //do the load
>
> This case is bad...

Which makes that 16% (above) into 48% and renormalizing to::
       ~ 63% fixed-displacement;
       ~ 36% register-indexed and support instructions.
>
>
> Also global variables outside the 2kB window:
>    LUI   X5, DispHi
>    ADDI  X5, X5, DispLo
>    ADD   X5, GP, X5
>    LW    Xn, X5, 0
>
> Where, sorting global variables by usage priority gives:
>    ~ 35%: in range
>    ~ 65%: not in range

Illustrating the falicy of 12-bits of displacement.

> Comparably, XG2 has a 16K or 32K reach here (depending on immediate
> size), which hits most of the global variables. The fallback Jumbo
> encoding hits the rest.

I get ±32K with 16-bit displacements

>
> Theoretically, could save 1 instruction here, but would need to add two
> more reloc types to allow for:
>    LUI, ADD, Lx
>    LUI, ADD, Sx
> Because annoyingly Load and Store have different displacement encodings;
> and I still need the base form for other cases.
>
>
> More compact way to load/store global variables would be to use absolute
> 32-bit or PC relative:
>    LUI + Lx/Sx : Abs32
>    AUIPC + Lx/Sx : PC-Rel32

     MEM    Rd,[IP,,DISP32/64]     // IP-rel

-----
>
> Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
> (there does seem to be some interest for ELF FDPIC but limited to 32-bit
> RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
> from PBO (namely, using GP for a global section and then chaining the
> sections for each binary).

How are you going to do dense PIC switch() {...} in RISC-V ??

>                            Main difference being that FDPIC uses fat
> function pointers and does the GP reload on the caller, vs PBO where I
> use narrow function pointers and do the reload on the callee (with
> load-time fixups for the PBO Offset).
>
>
> The result of all this is a whole lot of
                                           unnecessary
>                                                      Shifts and ADDs.

> Seemingly, even more for BGBCC than for GCC, which already had a lot of
> shifts and adds.
>
> BGBCC basically entirely dethrowns the Load and Store ops ...
>
>
> Possibly more so than GCC, which tended to turn most constant loads into
> memory loads. It would load a table of constants into a register and
> then pull constants from the table, rather than compose them inline.
>
> Say, something like:
>    AUIPC  X18, X18, DispHi
>    ADD    X18, X18, DispLo
>    (X18 now holds a table of constants, pointing into .rodata)
>
> And, when it needs a constant:
>    LW  Xn, X18, Disp  //offset of the constant it wants.
> Or:
>    LD  Xn, X18, Disp  //64-bit constant
>
>
> Currently, BGBCC does not use this strategy.
> Though, for 64-bit constants it could be more compact and faster.
>
> But, better still would be having Jumbo prefixes or similar, or even a
> SHORI instruction.

Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.

> Say, 64-bit constant-load in SH-5 or similar:
>    xxxxyyyyzzzzwwww
>    MOV   ImmX, Rn
>    SHORI ImmY, Rn
>    SHORI ImmZ, Rn
>    SHORI ImmW, Rn
> Where, one loads the constant in 16-bit chunks.

Yech

>>
>>
Don't you ever snip anything ??