Deutsch   English   Français   Italiano  
<v9brm4$33kmd$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Brett <ggtgp@yahoo.com>
Newsgroups: comp.arch
Subject: Re: My 66000 and High word facility
Date: Mon, 12 Aug 2024 02:23:00 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 181
Message-ID: <v9brm4$33kmd$1@dont-email.me>
References: <v98asi$rulo$1@dont-email.me>
 <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org>
 <v991kh$vu8g$1@dont-email.me>
 <2024Aug11.163333@mips.complang.tuwien.ac.at>
 <v9b57p$2rkrq$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Aug 2024 04:23:00 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2bd2793f0d26e17c5160cf6119c20726";
	logging-data="3265229"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+1oykO5rula0KhYkytxuZ9"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:NOsxz4HzPnu2schYCp9vMC6KJRU=
	sha1:rIUU2k8IqKXpDSpqtYX7YaRlC1A=
Bytes: 8705

BGB <cr88192@gmail.com> wrote:
> On 8/11/2024 9:33 AM, Anton Ertl wrote:
>> Brett <ggtgp@yahoo.com> writes:
>>> The lack of CPU’s with 64 registers is what makes for a market, that 4%
>>> that could benefit have no options to pick from.
>> 
>> They had:
>> 
>> SPARC: Ok, only 32 GPRs available at a time, but more in hardware
>> through the Window mechanism.
>> 
>> AMD29K: IIRC a 128-register stack and 64 additional registers
>> 
>> IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
>> files to make good use of them.
>> 
>> The additional registers obviously did not give these architectures a
>> decisive advantage.
>> 
>> When ARM designed A64, when the RISC-V people designed RISC-V, and
>> when Intel designed APX, each of them had the opportinity to go for 64
>> GPRs, but they decided not to.  Apparently the benefits do not
>> outweigh the disadvantages.
>> 
> 
> In my experience:
>   For most normal code, the advantage of 64 GPRs is minimal;
>   But, there is some code, where it does have an advantage.
>     Mostly involving big loops with lots of variables.
> 
> 
> Sometimes, it is preferable to be able to map functions entirely to 
> registers, and 64 does increase the probability of being able to do so 
> (though, neither achieves 100% of functions; and functions which map 
> entirely to GPRs with 32 will not see an advantage with 64).
> 
> Well, and to some extent the compiler needs to be selective about which 
> functions it allows to use all of the registers, since in some cases a 
> situation can come up where the saving/restoring more registers in the 
> prolog/epilog can cost more than the associated register spills.


Another benefit of 64 registers is more inlining removing calls.

A call can cause a significant amount of garbage code all around that call,
as it splits your function and burns registers that would otherwise get
used.

I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
But today that is an unserviced market which will get customers to give you
a look. Put out some vapor ware and see what customers say.


> But, have noted that 32 GPRs can get clogged up pretty quickly when 
> using them for FP-SIMD and similar (if working with 128-bit vectors as 
> register pairs); or otherwise when working with 128-bit data as pairs.
> 
> Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but 
> can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32 
> matrix, and 16 registers to perform a matrix-transpose, ...
> 
> Granted, arguably, doing a matrix-multiply directly in registers using 
> SIMD ops is a bit niche (traditional option being to use scalar 
> operations and fetch numbers from memory using "for()" loops, but this 
> is slower). Most of the programs don't need fast MatMult though.
> 
> 
> 
> Annoyingly, it has led to my ISA fragmenting into two variants:
>   Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
>     Supports R32..R63 for only a subset of the ISA for 32-bit ops.
>     For ops outside this subset, needs 64-bit encodings in these cases.
>   XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
>     By itself, would be easier to decode than Baseline,
>       as it drops a bunch of wonky edge cases.
>     Though, some cases were dropped from Baseline when XG2 was added.
>       "Op40x2" was dropped as it was hair and became mostly moot.
> 
> Then, a common subset exists known as Fix32, which can be decoded in 
> both Baseline and XG2 Mode, but only has access to R0..R31.
> 
> 
> Well, and a 3rd sub-variant:
>   XG2RV: Uses XG2's encodings but RISC-V's register space.
>     R0..R31 are X0..X31;
>     R32..R63 are F0..F31.
> 
> Arguable main use-case for XG2RV mode is for ASM blobs intended to be 
> called natively from RISC-V mode; but...
> 
> It is debatable whether such an operating mode actually makes sense, and 
> it might have made more sense to simply fake it in the ASM parser:
>   ADD R24, R25, R26  //Uses BJX2 register numbering.
>   ADD X14, X15, X16  //Uses RISC-V register remapping.
> 
> Likely, as a sub-mode of either Baseline or XG2 Mode.
> Since, the register remapping scheme is known as part of the ISA spec, 
> it could be done in the assembler.
> 
> It is possible that XG2RV mode may eventually be dropped due to "lack of 
> relevance".
> 
> 
> Well, and similarly any ABI thunks would need to be done in Baseline or 
> XG2 mode, since neither RV mode nor XG2RV Mode has access to all the 
> registers used for argument passing in BJX2.
> In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5, 
> being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.
> 
> 
> Well, and likewise one wants to limit the number of inter-ISA branches, 
> as the branch-predictor can't predict these, and they need a full 
> pipeline flush (a few extra cycles are needed to make sure the L1 I$ is 
> fetching in the correct mode). Technically also the L1 I$ needs to flush 
> any cache-lines which were fetched in a different mode (the I$ uses 
> internal tag-bits to to figure out things like instruction length and 
> bundling and to try to help with Superscalar in RV mode, *; mostly for 
> timing/latency reasons, ...).
> 
> 
> *: The way the BJX2 core deals with superscalar being to essentially 
> pretend as-if RV64 had WEX flag bits, which can be synthesized partly 
> when fetching cache lines (putting some of the latency in the I$ Miss 
> handling, rather than during instruction-fetch). In the ID stage, it 
> sees the longer PC step and infers that two instructions are being 
> decoded as superscalar.
> 
> ...
> 
> 
>> Where is your 4% number coming from?
>> 
> 
> 
> I guess it could make sense, arguably, to try to come up with test cases 
> to try to get a quantitative measurement of the effect of 64 GPRs for 
> programs which can make effective use of them...
> 
> Would be kind of a pain to test as 64 GPR programs couldn't run on a 
> kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in 
> kernel-space (and is the main thing in my case that seems to benefit 
> from 64 GPRs).
> 
> But, technically, a 32 GPR kernel couldn't run RISC-V programs either.
> 
> 
> So, would likely need to switch GLQuake and similar over to baseline 
> mode (and probably messing with "timedemo").
> 
> 
> 
> 
> Checking, as-is, timedemo results for "demo1" are "969 frames 150.5 
> seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would 
> be faster with RGB555 LDR), at 50 MHz.
> 
> GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".
> 
> But, yeah, both are with builds that use 64 GPRs.
> 
> 
> Software Quake: "969 frames 147.4 seconds 6.6 fps"
> Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"
> 
> Not going to bother with GLQuake in RISC-V mode, would likely take a 
> painfully long time.
> 
> Well, decided to run this test anyways:
>   "969 frames 687.3 seconds 1.4 fps"
> 
> 
> IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done 
> to make it fast within the limits of RV64G). Though, this is with it 
> running GL entirely in RV64 mode (it might fare better as a userland 
> application where the GL backend is running in kernel space in BJX2 mode).
========== REMAINDER OF ARTICLE TRUNCATED ==========