On 10/9/2024 11:19 AM, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 10:44:08 +0000, Robert Finch wrote:
>
Been thinking some about the carry and overflow and what to do about
register spills and reloads during expression processing. My thought was
that on the machine with 256 registers, simply allocate a ridiculous
number of registers for expression processing, for example 25 or even
50. Then if the expression is too complex, have the compiler spit out an
error message to the programmer to simplify the expression. Remnants of
the ‘expression too complex’ error in BASIC.
Both completely unacceptable, and in your case completely unnecessary.
in 967 subroutines I read out of My 66000 LLVM compile, I only have
3 cases of spill-fill, and that is with only 32 registers with uni-
versal constants.
Tends to be a bit higher IME, but granted my compiler is a bit more naive:
Either it can static-assign everything;
Or, it needs to use spill-and-fill.
In RISC-V mode:
Static-assign everything, Leaf: 13%
Partial assign, Leaf: 7.1%
Static-assign everything, Non-Leaf: 1.8%
Partial assign, Non-Leaf: 85%
Average, ~ 4.6 variables static-assigned
Out of 16.6 variables in a function.
In XG2 mode:
Static-assign everything, Leaf: 16%
Partial assign, Leaf: 0.7%
Static-assign everything, Non-Leaf: 1.9%
Partial assign, Non-Leaf: 82%
Average, ~ 4.8 variables static-assigned
Out of 16.8 variables in a function.
Theoretically, the number of static-assigned variables and fully static-assigned functions could be higher, but it looks like the compiler is excluding a lot of them for some reason (may need to look into it).
Of the RISC-V code I read alongside with 32+32 registers, I counted 8.
With 64 GPRs, there can be less spill/fill, and without any increase in the number of hardware registers vs RV64G's 32+32 scheme.
Rarely is register pressure equally balanced in this way, and more often it is one of:
High integer register pressure, little or no FP pressure (most code);
Very high FP register pressure, low integer pressure (say, unrolled matrix multiply).
Where, an even-split X/F scheme serves neither, and a bigger unified register space serves both.
Though, I guess the usual argument for split GPR/FPR spaces is that with unified register spaces, both ALU and FPU need to use the same pipeline.
But, if it is a shared register pipeline, one can also leverage ALU for a lot of edge cases, like FPU compare.
If one uses a longer pipeline for FPU ops vs ALU, it seems like one will still need to pay the costs of the longer FPU pipeline regardless of whether they are a single or separate register file.
Apparently, similar reasoning for the V extension using separate vector registers (vs just aliasing with the F registers), but I don't really want to implement the V extension.
Almost more tempting to do a cut-down non-conforming "V in F" style implementation:
* Aliases V to F register pairs;
** TBD if better to use V0..V15 or even-only numbering.
** Or, V0..V31 exist (if aliased) for 64b vectors,
** but only even for 128b.
* Will drop mask bits and other more advanced features.
* Trying to set up V properly would result in the instructions faulting.
** Could allow the possibility of adding proper V later.
With those statistics and 256 registers, If you can't get to essentially
0 spill=fill the problem is not with your architecture but with your
compiler.
With 256 registers, probably 99% of functions could use a "statically assign every variable to a register" strategy (though, assuming a case where one can reuse registers for temporary values).
Where, most temporary values are created and used within a single basic block, and if no references to that specific temporary exist outside of the basic block (and if not marked with a phi operator), the value of the temporary can simply be assumed to disappear at the end of a basic block. This can also allow temporaries to be allocated into scratch registers.
My own thought though is that going much bigger in terms of the main register file likely isn't worth it.
Only real compelling use for a bigger register file (much over 64) at the moment would be more for optimizing interrupts and context switches.