Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Misc: BGBCC targeting RV64G, initial results... Date: Wed, 9 Oct 2024 00:40:02 -0500 Organization: A noiseless patient Spider Lines: 105 Message-ID: References: <1b8c005f36fd5a86532103a8fb6a9ad6@www.novabbs.org> <58bd95eee31b53933be111d0d941203a@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 09 Oct 2024 07:40:02 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5a0c6d63c9b45fdf5260c4b6d0071564"; logging-data="2687345"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+OvyhEAnV9gTQ+8vrkD9+57LHT5UzAJDs=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:4ODit6Pml+nkCa974KMjLDnBGFU= In-Reply-To: Content-Language: en-US Bytes: 5226 On 10/8/2024 3:48 PM, MitchAlsup1 wrote: > On Tue, 8 Oct 2024 19:06:34 +0000, BGB wrote: > >> On 10/5/2024 6:10 PM, MitchAlsup1 wrote: >>> On Thu, 3 Oct 2024 8:20:46 +0000, BGB wrote: >>> >>>>> How well does JTT work with large tables? What if there are several >>>>> hundred table entries? >>> >>> Tables can have 2^16-1 (65534) case entries. >>> >> >> There is no hard-limit in my case, but BGBCC had generally broken up >> tables larger than 256. >> >> IIRC, this was because larger tables were more likely to have "voids" >> which could lead to a large number of branches to default, while still >> being over a 75% density threshold, splitting the table apart was more >> likely to expose these voids (and make the binary smaller). > > Yes, one would expect voids in tables would cause the compiler to > break the switch table into more dense chunks. > At present, there isn't anything to detect voids directly, but, density percentage is easy to detect: (100*count)/(max-min) .... But, as the count increases, the probability of large voids escaping notice also increases. Say, at 256, one could potentially have a hidden void of 64 labels (separating two regions at 100% density) without triggering a split. At present, it always splits the table in half (in terms of case-label count). Possibly, it could try to probe each point of the table and detect if there is another point where the density of each sub-table is somewhat higher than if the table were split evenly in half. >> I guess a possible tweak could be, say, if the density is over 93% or >> so, it will allow a table-jump regardless of size. > > If the void is greater than the overhead to break the table, then > the table should be broken. > To some amount, everything is heuristics. I guess also possible is to make it such that a the required density (to avoid split) is correlated to table size. Say (table size, minimum density): 128, 75% 256, 80% 512, 85% 1024, 90% 2048, 95% Would likely need to fiddle a bit with this though. >> Though, for the most part the programs I am testing with tend not to >> really have too many large switch blocks, and in the case of "switch" >> with a byte (most common case in these cases), a limit of 256 works. >> >> >> Meanwhile, for some other cases, like a switch full of TWOCC and FOURCC >> values, one really does not want to use a jump table (but, this is >> avoided as these cases tend to have a very low density). >> >> >> >>>>> For Q+ indirect jump the values loaded from the table replace the low >>>>> order bits of the PC instead of being a displacement. Only {W,T,O} are >>>>> supported. (W=wyde,T=tetra,O=octa). Should add an option for >>>>> displacements. Borrowed the memory indirect jump from the 68k. >>> >>> My 66000 Loads the table entry directly into IP in 1 cycle less >>> than LD latency. >>> >> >> I guess, specialized Load+Branch could potentially have less latency >> than separate load+branch, or the current strategy of double-branching. > > Think of it as a LD IP,[address] Possibly. Early on, I had a RET instruction that essentially did: MOV @SP+, PC But, at the time Load/Store operations were not pipelined. EX: Generated address, submit address and access request to L1 cache; Wait for response to become OK; Do whatever with result. Now, Load/Store is pipelined, but also branches are initiated from EX1, but no Load result until EX2 (special case, aligned-only DWORD/QWORD) or EX3 (generic). Would need a mechanism to allow initiating a branch from EX3.