On 2/1/2025 7:22 PM, MitchAlsup1 wrote:
On Sat, 1 Feb 2025 22:42:39 +0000, BGB wrote:
On 1/31/2025 10:05 PM, MitchAlsup1 wrote:
--------------------------------
Whereas, if performance is dominated by a piece of code that looks like,
say:
v0=dytf_int2fixnum(123);
v1=dytf_int2fixnum(456);
v2=dytf_mul(v0, v1);
v3=dytf_int2fixnum(789);
v4=dytf_add(v2, v3);
v5=dytf_wrapsymbol("x");
dytf_storeindex(obj, v5, v4);
...
With, say, N levels of call-graph in each called function, but with this
sort of code still managing to dominate the total CPU ("Self%" time).
>
This seems to be a situation where callee-save registers are a big win
for performance IME.
With callee save registers, the prologue and epilogue of subroutines
sees all the save/restore memory traffic; sometimes saving a register
that is not "in use" and restoring it later.
The compiler keeps track of which registers it uses in the function, and only needs to save/restore registers it needs to use.
So, say:
foo(VM *ctx)
{
int i, n;
n=ctx->numBar;
for(i=0; i<n; i++)
bar(ctx, i);
}
The prolog/epilog only really needs to save 4 registers or so (Say: i, ctx, T0, and LR).
If there are 31 callee-save registers; we save and restore 4 of them, NOT all 31 of them.
Saving 3 registers in the prolog, and restoring 3 in the epilog, is (most likely) going to be cheaper than a spill/reload pair for 'i' and a load for 'ctx' and 'n' each time around the loop (assuming in this case that 'ctx->numBar>2').
Say, in an RV example, callee save:
LI X18, 0
LW X20, 0(X19)
.L0:
MV X10, X19
MV X11, X18
JAL X1, bar
ADDI X18, X18, 1
BLT .L0, X18, X20
Or, scratch:
LI X15, 0
LW X13, 0(X14)
SW X14, SP(16)
SW X13, SP(8)
.L0:
SW X15, SP(0)
LW X14, SP(16)
MV X10, X14
MV X11, X15
JAL X1, bar
LW X15, SP(0)
LW X13, SP(8)
ADDI X15, X15, 1
BLT .L0, X15, X13
One can place bets which is going to be more efficient here for semi-large 'n'...
Granted, scratch registers may be "in general" better for temporaries, as the lifetime of temporaries is generally very short and they don't usually need to have their values preserved between basic blocks.
With caller save registers, the caller saves exactly the registers
it needs preserved, while the callee saves/restores none. Moreover
it only saves registers currently "in use" and may defer restoring
since it does not need that value in that register for a while.
It needs to store any scratch it uses that still hold a live variable, as soon as ANY functions are called.
If you use solely scratch registers, it means that for function call heavy code, there may be excessive amounts of spill-and-reload activity (this is part of what I suspect hurts GCC perf with this sort of code).
This only really makes sense IMO in terms of performance if one assumes (as GCC seems to assume) that function calls are infrequent in hot path code.
Though, I guess it is possible that a compiler heuristic could be put in place to try to classify functions based on their function-call density and switch between strategies based on this (say, using different register allocator behavior for call-light versus call-heavy functions).
So, the instruction path length has a better story in caller saves
than callee saves. Nothing that was "Not live" is ever saved or
restored.
Only relevant if there is a high probability that much of the code in a function wont be reached, and if one assumes that different variables will not be assigned to the same registers in different parts of the function.
But, relatedly, I had noted before that "always use as many registers as possible" is not the most efficient strategy.
In BGBCC, it partly divided the callee-save registers into 4 major groups, with XGPR:
R8..R14 (A)
R24..R31 (B)
R40..R47 (C)
R56..R63 (D)
Where:
Group A is always enabled;
Group B is enabled after a low threshold in XG1,
always enabled in XG2;
Group C is enabled after a high threshold in XG2;
XG1 + XGPR: Similar, but the threshold is higher.
XG1, No XGPR: N/A
Group D is enabled after a extra high threshold in XG2.
XG1 + XGPR: Similar, but the threshold is higher.
XG1, No XGPR: N/A
This is N/A for XG3 mostly as it uses the RV ABI, but:
X8/X9, X18..X27:
Enabled for RV modes and XG3
F8/F9, F18..F27:
RV Mode:
Disabled for normal registers;
Possible selective affinity for FP values;
XG3:
Enabled if a medium threshold is crossed.
Enabling a group does not mean all of it will be saved, but rather that it will allow allocating variables within that group (as opposed to forcing spill and reload within the set of already allocated registers).
Beyond the static-assigned variables, new registers will be reserved if:
The variable isn't already in a register;
There aren't any free spots within the currently reserved registers;
None of the prior spots has gone out of lifetime scope;
There are more more registers it is allowed to locate.
Otherwise, if this situation comes up, it will evict a prior value.
The arguments for callee save have to do with I cache footprint.
Maybe, but it effects performance as well.
As noted, IME, callee-save dominant (or exclusive) allocation does seem to be the more efficient strategy for call intensive code in my experience.