Article <ve3vru$29pnk$1@dont-email.me>

Deutsch English Français Italiano
<ve3vru$29pnk$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Misc: BGBCC targeting RV64G, initial results...
Date: Tue, 8 Oct 2024 14:06:34 -0500
Organization: A noiseless patient Spider
Lines: 134
Message-ID: <ve3vru$29pnk$1@dont-email.me>
References: <vd5uvd$mdgn$1@dont-email.me> <vd69n0$o0aj$1@dont-email.me>
 <vd6tf8$r27h$1@dont-email.me>
 <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>
 <vd7gk6$tquh$1@dont-email.me>
 <abf735f7cab1885028cc85bf34130fe9@www.novabbs.org>
 <vd80r8$148fc$1@dont-email.me>
 <58bd95eee31b53933be111d0d941203a@www.novabbs.org>
 <vdd1s0$22tpk$1@dont-email.me> <vdgh8i$2m458$1@dont-email.me>
 <vdlk51$3lm0a$1@dont-email.me>
 <dd19cb13c16cec5913df46da8083c867@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 08 Oct 2024 21:06:39 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a2e4a2be0156a2cdc1eabb2ae5cfde0b";
	logging-data="2418420"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/TLPf2rUqw//p1quf7Y4gD57FNuyZo/Ek="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Qdg1X7TbSVnki8heA26hEnJhBJk=
Content-Language: en-US
In-Reply-To: <dd19cb13c16cec5913df46da8083c867@www.novabbs.org>
Bytes: 6886

On 10/5/2024 6:10 PM, MitchAlsup1 wrote:
> On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:
> 
>> On 10/1/2024 5:00 AM, Robert Finch wrote:
>>> On 2024-09-29 10:19 p.m., BGB wrote:
>>>>>>> The ADD is not necessary if min == 0
>>>>>>>
>>>>>>> The JTT instruction compared Rt with 0 on the low side and max
>>>>>>> on the high side. If Ri is out of bounds, default is selected.
>>>>>>>
>>>>>>> The table displacements come in {B,H,W,D} selected in the JTT
>>>>>>> (jump through table) instruction. Rt indexes the table, its
>>>>>>> signed value is <<2 and added to address which happens to be
>>>>>>> address of JTT instruction + #(max+1)<<entry. {{The table is
>>>>>>> fetched through the ICache with execute permission}}
>>>>>>>
>>>>>>> Thus, the table is PIC; and generally 1/4 the size of typical
>>>>>>> switch tables.
> 
>>> How well does JTT work with large tables? What if there are several
>>> hundred table entries?
> 
> Tables can have 2^16-1 (65534) case entries.
> 

There is no hard-limit in my case, but BGBCC had generally broken up 
tables larger than 256.

IIRC, this was because larger tables were more likely to have "voids" 
which could lead to a large number of branches to default, while still 
being over a 75% density threshold, splitting the table apart was more 
likely to expose these voids (and make the binary smaller).

I guess a possible tweak could be, say, if the density is over 93% or 
so, it will allow a table-jump regardless of size.

Though, for the most part the programs I am testing with tend not to 
really have too many large switch blocks, and in the case of "switch" 
with a byte (most common case in these cases), a limit of 256 works.


Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC 
values, one really does not want to use a jump table (but, this is 
avoided as these cases tend to have a very low density).



>>> For Q+ indirect jump the values loaded from the table replace the low
>>> order bits of the PC instead of being a displacement. Only {W,T,O} are
>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for
>>> displacements. Borrowed the memory indirect jump from the 68k.
> 
> My 66000 Loads the table entry directly into IP in 1 cycle less
> than LD latency.
> 

I guess, specialized Load+Branch could potentially have less latency 
than separate load+branch, or the current strategy of double-branching.

Though, double-branch doesn't require anything special from the HW, and 
both BJX2 and RV can do it, though it is a little more wonky on RV.


Using a table of displacements (16 or 32 bit) could work, but would 
require a special type of reloc, basically: *DestWord = (TgtLbl-RefLbl).


IOW:
A sort of "what now?" scenario has come up.

Gluing Jumbo prefixes and similar onto RV64 does seem to help with 
performance, but still fails to match BJX2 here (but, BJX2 also has a 
more complicated instruction decoder vs RISC-V). It is a question at the 
moment if the cost of the BJX2 instruction decoder is enough to be 
significant.

May need to get around to trying to implement the word-transposed 
decoder thing and see if this can reduce LUT cost.


I still have doubts anyone in RV land is likely to take the jumbo-prefix 
strategy seriously. Though (apparently), one of the higher up people did 
mention having considered a similar sounding approach.

Albeit seemingly with the prefixes being handled as separately decoded 
instructions with hidden internal state in a register; which is less 
desirable as I see it, as the cost of the EX side logic to deal with 
this being exposed as a CSR is likely to be higher than the cost of 
always decoding it as a larger conjoined instruction (which requires 
that the prefix always directly precede the instruction that it modifies).


There is also a temptation to add "sign extend 16-bit short" and 
"sign-extend byte" ops, which as-is still require a pair of shift 
operations (though, at least the jumbo prefixes allow encoding "sign 
extend unsigned short" as "AND Rn, Rm, 0xFFFF"...).

Likewise, "ADD.UW Rd, Rs, Rt" exists as part of Zba, and I had defined 
jumbo rules to allow faking an "ADD.UW Rn, Rm, Imm" case...


TBD if I come up with some encoding scheme for how to map my existing 
SIMD system over to RV64, or converter helper ops.

What I would want would be a bit different from the V extension though.

Current thinking:
   The scalar Single-precision ops will work on 2x Binary32
     (as in my current implementation);
   Add scalar half-precision ops within RV64's existing scheme:
     But, these will work on 4x Binary16;
   Will likely use jumbo-prefixed encodings for most other stuff.
     Likely, most of my SIMD ops would exist in the F registers.

The actual 'V' extension would require a whole new chunk of register 
space, which I don't want to deal with.

So, say:
   FADD.S Fd, Fs, Ft  //Secretly a SIMD op
   FADD.H Fd, Fs, Ft  //Also secretly a SIMD op
   Can use FLD/FSD etc as before.

Would likely use mostly ops encoded via prefixes, specifics TBD, may 
define short-hands in the User2 / User3 blocks (7'h5B / 7'h7B).
Most likely this would be more relevant for common-ish Packed 4x Int16 
ops (like PADD.W, though may rename PIADD.H or similar for more 
consistency with RV naming conventions).


....


>>