On 8/11/2024 9:33 AM, Anton Ertl wrote:
Brett <ggtgp@yahoo.com> writes:
The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a
decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.
Sometimes, it is preferable to be able to map functions entirely to registers, and 64 does increase the probability of being able to do so (though, neither achieves 100% of functions; and functions which map entirely to GPRs with 32 will not see an advantage with 64).
Well, and to some extent the compiler needs to be selective about which functions it allows to use all of the registers, since in some cases a situation can come up where the saving/restoring more registers in the prolog/epilog can cost more than the associated register spills.
But, have noted that 32 GPRs can get clogged up pretty quickly when using them for FP-SIMD and similar (if working with 128-bit vectors as register pairs); or otherwise when working with 128-bit data as pairs.
Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32 matrix, and 16 registers to perform a matrix-transpose, ...
Granted, arguably, doing a matrix-multiply directly in registers using SIMD ops is a bit niche (traditional option being to use scalar operations and fetch numbers from memory using "for()" loops, but this is slower). Most of the programs don't need fast MatMult though.
Annoyingly, it has led to my ISA fragmenting into two variants:
Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
Supports R32..R63 for only a subset of the ISA for 32-bit ops.
For ops outside this subset, needs 64-bit encodings in these cases.
XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
By itself, would be easier to decode than Baseline,
as it drops a bunch of wonky edge cases.
Though, some cases were dropped from Baseline when XG2 was added.
"Op40x2" was dropped as it was hair and became mostly moot.
Then, a common subset exists known as Fix32, which can be decoded in both Baseline and XG2 Mode, but only has access to R0..R31.
Well, and a 3rd sub-variant:
XG2RV: Uses XG2's encodings but RISC-V's register space.
R0..R31 are X0..X31;
R32..R63 are F0..F31.
Arguable main use-case for XG2RV mode is for ASM blobs intended to be called natively from RISC-V mode; but...
It is debatable whether such an operating mode actually makes sense, and it might have made more sense to simply fake it in the ASM parser:
ADD R24, R25, R26 //Uses BJX2 register numbering.
ADD X14, X15, X16 //Uses RISC-V register remapping.
Likely, as a sub-mode of either Baseline or XG2 Mode.
Since, the register remapping scheme is known as part of the ISA spec, it could be done in the assembler.
It is possible that XG2RV mode may eventually be dropped due to "lack of relevance".
Well, and similarly any ABI thunks would need to be done in Baseline or XG2 mode, since neither RV mode nor XG2RV Mode has access to all the registers used for argument passing in BJX2.
In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5, being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.
Well, and likewise one wants to limit the number of inter-ISA branches, as the branch-predictor can't predict these, and they need a full pipeline flush (a few extra cycles are needed to make sure the L1 I$ is fetching in the correct mode). Technically also the L1 I$ needs to flush any cache-lines which were fetched in a different mode (the I$ uses internal tag-bits to to figure out things like instruction length and bundling and to try to help with Superscalar in RV mode, *; mostly for timing/latency reasons, ...).
*: The way the BJX2 core deals with superscalar being to essentially pretend as-if RV64 had WEX flag bits, which can be synthesized partly when fetching cache lines (putting some of the latency in the I$ Miss handling, rather than during instruction-fetch). In the ID stage, it sees the longer PC step and infers that two instructions are being decoded as superscalar.
...
Where is your 4% number coming from?
I guess it could make sense, arguably, to try to come up with test cases to try to get a quantitative measurement of the effect of 64 GPRs for programs which can make effective use of them...
Would be kind of a pain to test as 64 GPR programs couldn't run on a kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in kernel-space (and is the main thing in my case that seems to benefit from 64 GPRs).
But, technically, a 32 GPR kernel couldn't run RISC-V programs either.
So, would likely need to switch GLQuake and similar over to baseline mode (and probably messing with "timedemo").
Checking, as-is, timedemo results for "demo1" are "969 frames 150.5 seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would be faster with RGB555 LDR), at 50 MHz.
GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".
But, yeah, both are with builds that use 64 GPRs.
Software Quake: "969 frames 147.4 seconds 6.6 fps"
Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"
Not going to bother with GLQuake in RISC-V mode, would likely take a painfully long time.
Well, decided to run this test anyways:
"969 frames 687.3 seconds 1.4 fps"
IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done to make it fast within the limits of RV64G). Though, this is with it running GL entirely in RV64 mode (it might fare better as a userland application where the GL backend is running in kernel space in BJX2 mode).
Though, much of this is likely due more to RV64G's lack of SIMD and similar, rather than due to having fewer GPRs.
...
- anton