Liste des Groupes | Revenir à c arch |
On 8/11/2024 9:33 AM, Anton Ertl wrote:Brett <ggtgp@yahoo.com> writes:The lack of CPU’s with 64 registers is what makes for a market, that 4%
that could benefit have no options to pick from.
They had:
SPARC: Ok, only 32 GPRs available at a time, but more in hardware
through the Window mechanism.
AMD29K: IIRC a 128-register stack and 64 additional registers
IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
files to make good use of them.
The additional registers obviously did not give these architectures a
decisive advantage.
When ARM designed A64, when the RISC-V people designed RISC-V, and
when Intel designed APX, each of them had the opportinity to go for 64
GPRs, but they decided not to. Apparently the benefits do not
outweigh the disadvantages.
In my experience:
For most normal code, the advantage of 64 GPRs is minimal;
But, there is some code, where it does have an advantage.
Mostly involving big loops with lots of variables.
Sometimes, it is preferable to be able to map functions entirely to
registers, and 64 does increase the probability of being able to do so
(though, neither achieves 100% of functions; and functions which map
entirely to GPRs with 32 will not see an advantage with 64).
Well, and to some extent the compiler needs to be selective about which
functions it allows to use all of the registers, since in some cases a
situation can come up where the saving/restoring more registers in the
prolog/epilog can cost more than the associated register spills.
But, have noted that 32 GPRs can get clogged up pretty quickly when
using them for FP-SIMD and similar (if working with 128-bit vectors as
register pairs); or otherwise when working with 128-bit data as pairs.
Similarly, one can't fit a 4x4 matrix multiply entirely in 32 GPRs, but
can in 64 GPRs. Where it takes 8 registers to hold a 4x4 Binary32
matrix, and 16 registers to perform a matrix-transpose, ...
Granted, arguably, doing a matrix-multiply directly in registers using
SIMD ops is a bit niche (traditional option being to use scalar
operations and fetch numbers from memory using "for()" loops, but this
is slower). Most of the programs don't need fast MatMult though.
Annoyingly, it has led to my ISA fragmenting into two variants:
Baseline: Primarily 32 GPR, 16/32/64/96 encoding;
Supports R32..R63 for only a subset of the ISA for 32-bit ops.
For ops outside this subset, needs 64-bit encodings in these cases.
XG2: Supports R32..R63 everywhere, but loses 16-bit ops.
By itself, would be easier to decode than Baseline,
as it drops a bunch of wonky edge cases.
Though, some cases were dropped from Baseline when XG2 was added.
"Op40x2" was dropped as it was hair and became mostly moot.
Then, a common subset exists known as Fix32, which can be decoded in
both Baseline and XG2 Mode, but only has access to R0..R31.
Well, and a 3rd sub-variant:
XG2RV: Uses XG2's encodings but RISC-V's register space.
R0..R31 are X0..X31;
R32..R63 are F0..F31.
Arguable main use-case for XG2RV mode is for ASM blobs intended to be
called natively from RISC-V mode; but...
It is debatable whether such an operating mode actually makes sense, and
it might have made more sense to simply fake it in the ASM parser:
ADD R24, R25, R26 //Uses BJX2 register numbering.
ADD X14, X15, X16 //Uses RISC-V register remapping.
Likely, as a sub-mode of either Baseline or XG2 Mode.
Since, the register remapping scheme is known as part of the ISA spec,
it could be done in the assembler.
It is possible that XG2RV mode may eventually be dropped due to "lack of
relevance".
Well, and similarly any ABI thunks would need to be done in Baseline or
XG2 mode, since neither RV mode nor XG2RV Mode has access to all the
registers used for argument passing in BJX2.
In this case, RISC-V mode only has ~ 26 GPRs (the remaining 6, X0..X5,
being SPRs or CRs). In the RV modes R0/R4/R5/R14 are inaccessible.
Well, and likewise one wants to limit the number of inter-ISA branches,
as the branch-predictor can't predict these, and they need a full
pipeline flush (a few extra cycles are needed to make sure the L1 I$ is
fetching in the correct mode). Technically also the L1 I$ needs to flush
any cache-lines which were fetched in a different mode (the I$ uses
internal tag-bits to to figure out things like instruction length and
bundling and to try to help with Superscalar in RV mode, *; mostly for
timing/latency reasons, ...).
*: The way the BJX2 core deals with superscalar being to essentially
pretend as-if RV64 had WEX flag bits, which can be synthesized partly
when fetching cache lines (putting some of the latency in the I$ Miss
handling, rather than during instruction-fetch). In the ID stage, it
sees the longer PC step and infers that two instructions are being
decoded as superscalar.
...
Where is your 4% number coming from?
I guess it could make sense, arguably, to try to come up with test cases
to try to get a quantitative measurement of the effect of 64 GPRs for
programs which can make effective use of them...
Would be kind of a pain to test as 64 GPR programs couldn't run on a
kernel built in 32 GPR mode, but TKRA-GL runs most of its backend in
kernel-space (and is the main thing in my case that seems to benefit
from 64 GPRs).
But, technically, a 32 GPR kernel couldn't run RISC-V programs either.
So, would likely need to switch GLQuake and similar over to baseline
mode (and probably messing with "timedemo").
Checking, as-is, timedemo results for "demo1" are "969 frames 150.5
seconds 6.4 fps", but this is with my experimental FP8U HDR mode (would
be faster with RGB555 LDR), at 50 MHz.
GLQuake, LDR RGB555 mode: "969 frames 119.0 seconds 8.1 fps".
But, yeah, both are with builds that use 64 GPRs.
Software Quake: "969 frames 147.4 seconds 6.6 fps"
Software Quake (RV64G): "969 frames 157.3 seconds 6.2 fps"
Not going to bother with GLQuake in RISC-V mode, would likely take a
painfully long time.
Well, decided to run this test anyways:
"969 frames 687.3 seconds 1.4 fps"
IOW: TKRA-GL runs horribly bad in RV64G mode (and not much can be done
to make it fast within the limits of RV64G). Though, this is with it
running GL entirely in RV64 mode (it might fare better as a userland
application where the GL backend is running in kernel space in BJX2 mode).
Though, much of this is likely due more to RV64G's lack of SIMD and
similar, rather than due to having fewer GPRs.
Les messages affichés proviennent d'usenet.