Article <v9b57p$2rkrq$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v9b57p$2rkrq$1@dont-email.me>
Deutsch English Français Italiano
<v9b57p$2rkrq$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: My 66000 and High word facility
Date: Sun, 11 Aug 2024 14:59:51 -0500
Organization: A noiseless patient Spider
Lines: 172
Message-ID: <v9b57p$2rkrq$1@dont-email.me>
References: <v98asi$rulo$1@dont-email.me>
 <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org>
 <v991kh$vu8g$1@dont-email.me> <2024Aug11.163333@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 11 Aug 2024 21:59:54 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2a951f56a044c306bc06db235766d957";
	logging-data="3003258"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18DwAAYjt7CSrWIfzpWBvsljSXGbb9o2U8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:gjRtx2L95zv0ZIvhtsuXZ3eKa64=
In-Reply-To: <2024Aug11.163333@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 7929

On 8/11/2024 9:33 AM, Anton Ertl wrote:
> Brett <ggtgp@yahoo.com> writes:
>> The lack of CPU’s with 64 registers is what makes for a market, that 4%
>> that could benefit have no options to pick from.
> 
> They had:
> 
> SPARC: Ok, only 32 GPRs available at a time, but more in hardware
> through the Window mechanism.
> 
> AMD29K: IIRC a 128-register stack and 64 additional registers
> 
> IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
> files to make good use of them.
> 
> The additional registers obviously did not give these architectures a
> decisive advantage.
> 
> When ARM designed A64, when the RISC-V people designed RISC-V, and
> when Intel designed APX, each of them had the opportinity to go for 64
> GPRs, but they decided not to.  Apparently the benefits do not
> outweigh the disadvantages.
> 

In my experience:
   For most normal code, the advantage of 64 GPRs is minimal;
   But, there is some code, where it does have an advantage.
     Mostly involving big loops with lots of variables.


Sometimes, it is preferable to be able to map functions entirely to 
registers, and 64 does increase the probability of being able to do so 
(though, neither achieves 100% of functions; and functions which map 
entirely to GPRs with 32 will not see an advantage with 64).

Well, and to some extent the compiler needs to be selective about which 
functions it allows to use all of the registers, since in some cases a 
situation can come up where the saving/restoring more registers in the 
prolog/epilog can cost more than the associated register spills.


But, have noted that 32 GPRs can get clogged up pretty quickly when 
using them for FP-SIMD and similar (if working with 128-bit vectors as 
register pairs); or otherwise when working with 128-bit data as pairs.

Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but 
can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32 
matrix, and 16 registers to perform a matrix-transpose, ...

Granted, arguably, doing a matrix-multiply directly in registers using 
SIMD ops is a bit niche (traditional option being to use scalar 
operations and fetch numbers from memory using "for()" loops, but this 
is slower). Most of the programs don't need fast MatMult though.



Annoyingly, it has led to my ISA fragmenting into two variants:
   Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
     Supports R32..R63 for only a subset of the ISA for 32-bit ops.
     For ops outside this subset, needs 64-bit encodings in these cases.
   XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
     By itself, would be easier to decode than Baseline,
       as it drops a bunch of wonky edge cases.
     Though, some cases were dropped from Baseline when XG2 was added.
       "Op40x2" was dropped as it was hair and became mostly moot.

Then, a common subset exists known as Fix32, which can be decoded in 
both Baseline and XG2 Mode, but only has access to R0..R31.


Well, and a 3rd sub-variant:
   XG2RV: Uses XG2's encodings but RISC-V's register space.
     R0..R31 are X0..X31;
     R32..R63 are F0..F31.

Arguable main use-case for XG2RV mode is for ASM blobs intended to be 
called natively from RISC-V mode; but...

It is debatable whether such an operating mode actually makes sense, and 
it might have made more sense to simply fake it in the ASM parser:
   ADD R24, R25, R26  //Uses BJX2 register numbering.
   ADD X14, X15, X16  //Uses RISC-V register remapping.

Likely, as a sub-mode of either Baseline or XG2 Mode.
Since, the register remapping scheme is known as part of the ISA spec, 
it could be done in the assembler.

It is possible that XG2RV mode may eventually be dropped due to "lack of 
relevance".


Well, and similarly any ABI thunks would need to be done in Baseline or 
XG2 mode, since neither RV mode nor XG2RV Mode has access to all the 
registers used for argument passing in BJX2.
In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5, 
being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.


Well, and likewise one wants to limit the number of inter-ISA branches, 
as the branch-predictor can't predict these, and they need a full 
pipeline flush (a few extra cycles are needed to make sure the L1 I$ is 
fetching in the correct mode). Technically also the L1 I$ needs to flush 
any cache-lines which were fetched in a different mode (the I$ uses 
internal tag-bits to to figure out things like instruction length and 
bundling and to try to help with Superscalar in RV mode, *; mostly for 
timing/latency reasons, ...).


*: The way the BJX2 core deals with superscalar being to essentially 
pretend as-if RV64 had WEX flag bits, which can be synthesized partly 
when fetching cache lines (putting some of the latency in the I$ Miss 
handling, rather than during instruction-fetch). In the ID stage, it 
sees the longer PC step and infers that two instructions are being 
decoded as superscalar.

....


> Where is your 4% number coming from?
> 


I guess it could make sense, arguably, to try to come up with test cases 
to try to get a quantitative measurement of the effect of 64 GPRs for 
programs which can make effective use of them...

Would be kind of a pain to test as 64 GPR programs couldn't run on a 
kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in 
kernel-space (and is the main thing in my case that seems to benefit 
from 64 GPRs).

But, technically, a 32 GPR kernel couldn't run RISC-V programs either.


So, would likely need to switch GLQuake and similar over to baseline 
mode (and probably messing with "timedemo").




Checking, as-is, timedemo results for "demo1" are "969 frames 150.5 
seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would 
be faster with RGB555 LDR), at 50 MHz.

GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".

But, yeah, both are with builds that use 64 GPRs.


Software Quake: "969 frames 147.4 seconds 6.6 fps"
Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"

Not going to bother with GLQuake in RISC-V mode, would likely take a 
painfully long time.

Well, decided to run this test anyways:
   "969 frames 687.3 seconds 1.4 fps"


IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done 
to make it fast within the limits of RV64G). Though, this is with it 
running GL entirely in RV64 mode (it might fare better as a userland 
application where the GL backend is running in kernel space in BJX2 mode).

Though, much of this is likely due more to RV64G's lack of SIMD and 
similar, rather than due to having fewer GPRs.

....


> - anton