Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Misc: BGBCC targeting RV64G, initial results...
Date: Wed, 9 Oct 2024 00:40:02 -0500
Organization: A noiseless patient Spider
Lines: 105
Message-ID: <ve54vi$2i0bh$1@dont-email.me>
References: <vd5uvd$mdgn$1@dont-email.me> <vd69n0$o0aj$1@dont-email.me>
 <vd6tf8$r27h$1@dont-email.me>
 <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org>
 <vd7gk6$tquh$1@dont-email.me>
 <abf735f7cab1885028cc85bf34130fe9@www.novabbs.org>
 <vd80r8$148fc$1@dont-email.me>
 <58bd95eee31b53933be111d0d941203a@www.novabbs.org>
 <vdd1s0$22tpk$1@dont-email.me> <vdgh8i$2m458$1@dont-email.me>
 <vdlk51$3lm0a$1@dont-email.me>
 <dd19cb13c16cec5913df46da8083c867@www.novabbs.org>
 <ve3vru$29pnk$1@dont-email.me>
 <b9e523bdf11b3422c719ce2d8ad4c9d4@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 09 Oct 2024 07:40:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5a0c6d63c9b45fdf5260c4b6d0071564";
	logging-data="2687345"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+OvyhEAnV9gTQ+8vrkD9+57LHT5UzAJDs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:4ODit6Pml+nkCa974KMjLDnBGFU=
In-Reply-To: <b9e523bdf11b3422c719ce2d8ad4c9d4@www.novabbs.org>
Content-Language: en-US
Bytes: 5226

On 10/8/2024 3:48 PM, MitchAlsup1 wrote:
> On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote:
> 
>> On 10/5/2024 6:10 PM, MitchAlsup1 wrote:
>>> On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote:
>>>
>>>>> How well does JTT work with large tables? What if there are several
>>>>> hundred table entries?
>>>
>>> Tables can have 2^16-1 (65534) case entries.
>>>
>>
>> There is no hard-limit in my case, but BGBCC had generally broken up
>> tables larger than 256.
>>
>> IIRC, this was because larger tables were more likely to have "voids"
>> which could lead to a large number of branches to default, while still
>> being over a 75% density threshold, splitting the table apart was more
>> likely to expose these voids (and make the binary smaller).
> 
> Yes, one would expect voids in tables would cause the compiler to
> break the switch table into more dense chunks.
> 

At present, there isn't anything to detect voids directly, but, density 
percentage is easy to detect:
   (100*count)/(max-min)
....

But, as the count increases, the probability of large voids escaping 
notice also increases.

Say, at 256, one could potentially have a hidden void of 64 labels 
(separating two regions at 100% density) without triggering a split.


At present, it always splits the table in half (in terms of case-label 
count). Possibly, it could try to probe each point of the table and 
detect if there is another point where the density of each sub-table is 
somewhat higher than if the table were split evenly in half.


>> I guess a possible tweak could be, say, if the density is over 93% or
>> so, it will allow a table-jump regardless of size.
> 
> If the void is greater than the overhead to break the table, then
> the table should be broken.
> 

To some amount, everything is heuristics.

I guess also possible is to make it such that a the required density (to 
avoid split) is correlated to table size.

Say (table size, minimum density):
   128, 75%
   256, 80%
   512, 85%
   1024, 90%
   2048, 95%

Would likely need to fiddle a bit with this though.


>> Though, for the most part the programs I am testing with tend not to
>> really have too many large switch blocks, and in the case of "switch"
>> with a byte (most common case in these cases), a limit of 256 works.
>>
>>
>> Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC
>> values, one really does not want to use a jump table (but, this is
>> avoided as these cases tend to have a very low density).
>>
>>
>>
>>>>> For Q+ indirect jump the values loaded from the table replace the low
>>>>> order bits of the PC instead of being a displacement. Only {W,T,O} are
>>>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for
>>>>> displacements. Borrowed the memory indirect jump from the 68k.
>>>
>>> My 66000 Loads the table entry directly into IP in 1 cycle less
>>> than LD latency.
>>>
>>
>> I guess, specialized Load+Branch could potentially have less latency
>> than separate load+branch, or the current strategy of double-branching.
> 
> Think of it as a LD IP,[address]


Possibly. Early on, I had a RET instruction that essentially did:
   MOV @SP+, PC
But, at the time Load/Store operations were not pipelined.
   EX:
     Generated address, submit address and access request to L1 cache;
     Wait for response to become OK;
     Do whatever with result.

Now, Load/Store is pipelined, but also branches are initiated from EX1, 
but no Load result until EX2 (special case, aligned-only DWORD/QWORD) or 
EX3 (generic).

Would need a mechanism to allow initiating a branch from EX3.