Deutsch English Français Italiano |
<v9brm4$33kmd$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Brett <ggtgp@yahoo.com> Newsgroups: comp.arch Subject: Re: My 66000 and High word facility Date: Mon, 12 Aug 2024 02:23:00 -0000 (UTC) Organization: A noiseless patient Spider Lines: 181 Message-ID: <v9brm4$33kmd$1@dont-email.me> References: <v98asi$rulo$1@dont-email.me> <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org> <v991kh$vu8g$1@dont-email.me> <2024Aug11.163333@mips.complang.tuwien.ac.at> <v9b57p$2rkrq$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Mon, 12 Aug 2024 04:23:00 +0200 (CEST) Injection-Info: dont-email.me; posting-host="2bd2793f0d26e17c5160cf6119c20726"; logging-data="3265229"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+1oykO5rula0KhYkytxuZ9" User-Agent: NewsTap/5.5 (iPad) Cancel-Lock: sha1:NOsxz4HzPnu2schYCp9vMC6KJRU= sha1:rIUU2k8IqKXpDSpqtYX7YaRlC1A= Bytes: 8705 BGB <cr88192@gmail.com> wrote: > On 8/11/2024 9:33 AM, Anton Ertl wrote: >> Brett <ggtgp@yahoo.com> writes: >>> The lack of CPU’s with 64 registers is what makes for a market, that 4% >>> that could benefit have no options to pick from. >> >> They had: >> >> SPARC: Ok, only 32 GPRs available at a time, but more in hardware >> through the Window mechanism. >> >> AMD29K: IIRC a 128-register stack and 64 additional registers >> >> IA-64: 128 GPRs and 128 FPRs with register stack and rotating register >> files to make good use of them. >> >> The additional registers obviously did not give these architectures a >> decisive advantage. >> >> When ARM designed A64, when the RISC-V people designed RISC-V, and >> when Intel designed APX, each of them had the opportinity to go for 64 >> GPRs, but they decided not to. Apparently the benefits do not >> outweigh the disadvantages. >> > > In my experience: > For most normal code, the advantage of 64 GPRs is minimal; > But, there is some code, where it does have an advantage. > Mostly involving big loops with lots of variables. > > > Sometimes, it is preferable to be able to map functions entirely to > registers, and 64 does increase the probability of being able to do so > (though, neither achieves 100% of functions; and functions which map > entirely to GPRs with 32 will not see an advantage with 64). > > Well, and to some extent the compiler needs to be selective about which > functions it allows to use all of the registers, since in some cases a > situation can come up where the saving/restoring more registers in the > prolog/epilog can cost more than the associated register spills. Another benefit of 64 registers is more inlining removing calls. A call can cause a significant amount of garbage code all around that call, as it splits your function and burns registers that would otherwise get used. I can understand the reluctance to go to 6 bit register specifiers, it burns up your opcode space and makes encoding everything more difficult. But today that is an unserviced market which will get customers to give you a look. Put out some vapor ware and see what customers say. > But, have noted that 32 GPRs can get clogged up pretty quickly when > using them for FP-SIMD and similar (if working with 128-bit vectors as > register pairs); or otherwise when working with 128-bit data as pairs. > > Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but > can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32 > matrix, and 16 registers to perform a matrix-transpose, ... > > Granted, arguably, doing a matrix-multiply directly in registers using > SIMD ops is a bit niche (traditional option being to use scalar > operations and fetch numbers from memory using "for()" loops, but this > is slower). Most of the programs don't need fast MatMult though. > > > > Annoyingly, it has led to my ISA fragmenting into two variants: > Baseline: Primarily 32 GPR, 16/32/64/96 encoding; > Supports R32..R63 for only a subset of the ISA for 32-bit ops. > For ops outside this subset, needs 64-bit encodings in these cases. > XG2: Supports R32..R63 everywhere, but loses 16-bit ops. > By itself, would be easier to decode than Baseline, > as it drops a bunch of wonky edge cases. > Though, some cases were dropped from Baseline when XG2 was added. > "Op40x2" was dropped as it was hair and became mostly moot. > > Then, a common subset exists known as Fix32, which can be decoded in > both Baseline and XG2 Mode, but only has access to R0..R31. > > > Well, and a 3rd sub-variant: > XG2RV: Uses XG2's encodings but RISC-V's register space. > R0..R31 are X0..X31; > R32..R63 are F0..F31. > > Arguable main use-case for XG2RV mode is for ASM blobs intended to be > called natively from RISC-V mode; but... > > It is debatable whether such an operating mode actually makes sense, and > it might have made more sense to simply fake it in the ASM parser: > ADD R24, R25, R26 //Uses BJX2 register numbering. > ADD X14, X15, X16 //Uses RISC-V register remapping. > > Likely, as a sub-mode of either Baseline or XG2 Mode. > Since, the register remapping scheme is known as part of the ISA spec, > it could be done in the assembler. > > It is possible that XG2RV mode may eventually be dropped due to "lack of > relevance". > > > Well, and similarly any ABI thunks would need to be done in Baseline or > XG2 mode, since neither RV mode nor XG2RV Mode has access to all the > registers used for argument passing in BJX2. > In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5, > being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible. > > > Well, and likewise one wants to limit the number of inter-ISA branches, > as the branch-predictor can't predict these, and they need a full > pipeline flush (a few extra cycles are needed to make sure the L1 I$ is > fetching in the correct mode). Technically also the L1 I$ needs to flush > any cache-lines which were fetched in a different mode (the I$ uses > internal tag-bits to to figure out things like instruction length and > bundling and to try to help with Superscalar in RV mode, *; mostly for > timing/latency reasons, ...). > > > *: The way the BJX2 core deals with superscalar being to essentially > pretend as-if RV64 had WEX flag bits, which can be synthesized partly > when fetching cache lines (putting some of the latency in the I$ Miss > handling, rather than during instruction-fetch). In the ID stage, it > sees the longer PC step and infers that two instructions are being > decoded as superscalar. > > ... > > >> Where is your 4% number coming from? >> > > > I guess it could make sense, arguably, to try to come up with test cases > to try to get a quantitative measurement of the effect of 64 GPRs for > programs which can make effective use of them... > > Would be kind of a pain to test as 64 GPR programs couldn't run on a > kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in > kernel-space (and is the main thing in my case that seems to benefit > from 64 GPRs). > > But, technically, a 32 GPR kernel couldn't run RISC-V programs either. > > > So, would likely need to switch GLQuake and similar over to baseline > mode (and probably messing with "timedemo"). > > > > > Checking, as-is, timedemo results for "demo1" are "969 frames 150.5 > seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would > be faster with RGB555 LDR), at 50 MHz. > > GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps". > > But, yeah, both are with builds that use 64 GPRs. > > > Software Quake: "969 frames 147.4 seconds 6.6 fps" > Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps" > > Not going to bother with GLQuake in RISC-V mode, would likely take a > painfully long time. > > Well, decided to run this test anyways: > "969 frames 687.3 seconds 1.4 fps" > > > IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done > to make it fast within the limits of RV64G). Though, this is with it > running GL entirely in RV64 mode (it might fare better as a userland > application where the GL backend is running in kernel space in BJX2 mode). ========== REMAINDER OF ARTICLE TRUNCATED ==========