Article <vnglop$33lk0$1@dont-email.me>

Deutsch English Français Italiano
<vnglop$33lk0$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Misc: Ongoing status...
Date: Thu, 30 Jan 2025 14:00:22 -0600
Organization: A noiseless patient Spider
Lines: 203
Message-ID: <vnglop$33lk0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 30 Jan 2025 21:00:25 +0100 (CET)
Injection-Info: dont-email.me; posting-host="406ccd3512157dc29eda74131d5502b6";
	logging-data="3266176"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18i+X/SFH0mPbgtHnu+xVCnH3nyVqkPhIE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:dxaYjwMhxvo5JKbm4DkcECVzZsE=
Content-Language: en-US
Bytes: 8802

So, recent features added to my core ISA: None.
Reason: Not a whole lot that brings much benefit.


Have ended up recently more working on the RISC-V side of things, 
because there are still gains to be made there (stuff is still more 
buggy, less complete, and slower than XG2).


On the RISC-V side, did experiment with Branch-compare-Immediate 
instructions, but unclear if I will carry them over:
   Adds a non-zero cost to the decoder;
     Cost primarily associated with dealing with a second immed.
   Effect on performance is very small (< 1%).

In my case, I added them as jumbo-prefixed forms, so:
   BEQI Imm17s, Rs, Disp12s

Also added Store-with-Immediate, with a similar mechanism:
   MOV.L  Imm17s, (Rm, Disp12s*1)
As, it basically dropped out for free.

Also unclear if it will be carried over. Also gains little, as in most 
of the store-with-immediate scenarios, the immediate is 0.


Instructions with a less than 1% gain and no compelling edge case, are 
essentially clutter.

I can note that some of the niche ops I did add, like special-case 
RGB555 to Index8 or RGBI, were because at least they had a significant 
effect in one use-case (such as, speeding up how quickly the GUI can do 
redraw operations).

My usual preference in these cases is to assign 64-bit encodings, as the 
instructions might only be used in a few edge cases, so it becomes a 
waste to assign them spots in the more valuable 32-bit encoding space.


The more popular option was seemingly another person's option, to define 
them as 32-bit encodings.
   Their proposal was effectively:
     Bcc Imm5, Rs1', Disp12
       (IOW: a 3-bit register field, in a 32-bit instruction)
   I don't like this, this is very off-balance.
     Better IMO: Bcc Imm6s, Rs1, Disp9s (+/- 512B)

The 3-bit register field also makes it nearly useless with my compiler, 
as my compiler (in its RV mode) primarily uses X18..X27 for variables 
(IOW: the callee save registers). But, maybe moot, as either way it 
would still save less than 1%.

Also, as for any ops with 3-bit registers:
   Would make superscalar harder and more expensive;
   Would add ugly edge cases and cost to the instruction decoder;
   ...

I would prefer it if people not went that route (and tried to keep 
things at least mostly consistent, trying to avoid making a dog chewed 
mess of the 32-bit ISA).

If you really feel the need for 3-bit register fields... Maybe, go to a 
larger encoding?...


When I defined my own version of BccI (with a 64-bit encoding), how many 
new instructions did I need to define in the 32-bit base ISA: Zero.

And, I can also have things like:
   SEQI  Rs, Imm17s, Rd  // Rd = (Rs==Imm17s)
   Etc.
Also without adding anything new to the 32-bit encoding space.



I am personally feeling kinda useless right now, as nothing new or 
compelling in this, mostly trying to hunt down bugs in my Verilog code.

The RISC-V mode is still not entirely stable if jumbo prefixes are used, 
but I can't seem to find the issue.

Comparably, XG3 is even less stable, but I have put it at a lower 
priority than stabilizing RV+Jx (as XG3 depends on RV+Jx being stable).


Seemingly, people are not very accepting of either jumbo prefixes or 
register-indexed load/store, but in my testing, these are the two 
biggest performance improvements.

Did recently change the encoding space for RV jumbo prefixes from 
0000401B to 0000403F, namely, putting them in the 64-bit opcode space, 
because this was (more or less) what they do (this takes up 1/8 of the 
64-bit encoding space).



Did recently experiment with a compacted 48-bit encoding (Jumbo-Mini-48, 
or JM48). It more effectively hacks apart the 32-bit instruction and 
some bits from the prefix, and crams them into a 48 bit encoding (using 
1/4 of the 48-bit encoding space).

These basically give Imm22/Disp22 encodings for the existing Imm12 
encodings. There was also a considered JAL and AUIPC Disp30 case, but 
will need to special-case these (these would have a range of +/- 1GB).

They essentially mirror the existing 32-bit encoding space.
   Just with some special-case decoding tweaks depending on the block.

Mechanism was basically that, along-side the XG3 decoder, had added some 
logic to essentially dynamically repack 48-bit encodings into the 64-bit 
jumbo-prefixed forms (which are then what are seen by the decoder proper).


Where, the existing/known 48-bit ops had taken the form (partly taking a 
guess, existing table was not super clear):
   iii* - yyy0-nnnnn-0011111
     000:   L.LI.48	Imm32, Rn
     001: ? L.ADD.48	Imm32, Rn
     010: ? L.JAL.48	Disp32, X0
     011: ? L.JAL.48	Disp32, X1
     100:   L.SHORI.48	Imm32, Rn
     101: ? L.AND.48	Imm32, Rn
     110: ? L.OR.48	Imm32, Rn
     111: ? L.XOR.48	Imm32, Rn
I had then made a claim of:
   zzz* - yyy1-nnnnn-0011111
For the JM48 encodings.

Seemingly, they had used 1011111 mostly for Disp26 branch-ops.
   IMHO, kind of a waste...


My JM48 encodings, while not giving as big of an immediate, do at least 
also give things like Load/Store instructions. Which, you can't add, if 
you have already burned the entire 48-bit encoding space on 2RI ops and 
Bcc variants...

Had they used Disp24, then one could have both Bcc and basic Load/Store 
ops. Though, realistically, one usually only needs to Bcc within a 
single function, so even 24 bits is overkill (I would be more inclined 
towards, say, 16 or 17 bits).


Or, in my JM48 scheme, for the cost of (only) 22 bit displacements, one 
gets: All of the existing Disp12 ops... (Just with 10 bits glued on).

And, more so, in its present form, in under 200 lines of Verilog...



Currently, JM48 encodings can't encode references to R32..R63 (XGPR's), 
but this may not be a huge loss:
I currently intend it for probable use with RVC;
Code in RV64GC mode or similar, will be very unlikely to also be using 
XGPR's;
The relative cost increase from a 48 to 64-bit encoding to use XGPR's 
will probably be smaller.

I may consider allowing JM48 with 3R ops to potentially access XGPRs, 
say (with a JV bit):
   JV=0:
     Gives XGPRs and some extended opcode (need ~ 3b + 4b here);
     Still leaves 2 bits remaining, probably MBZ for now.
   JV=1:
     Gives a 14-bit synthesized immediate (with no XGPRs).
       Or, maybe 12-bit (with XGPRs).
       Still, TBD.

It is possible, I could have used a combined JV/JT bit for the Imm12 
ops, allowing a choice between Disp21 or Disp17 with XGPRs, but decided 
to just go with the simpler option of Disp22 with no option of XGPRs (if 
they need XGPRs, next option up being a 64-bit encoding, where it is a 
choice between Disp33 or Disp17 with XGPRs).

Or, if the 3R blocks are given XGPRs with synthesized immediate values, 
could use synthesized 12-bit immediate encodings for cases where XGPR is 
needed. Need to decide between these options.


For maximizing performance, I would just assume sticking mostly to 32 
and 64 bit encodings. But, many people are judging the "goodness" more 
========== REMAINDER OF ARTICLE TRUNCATED ==========