Liste des Groupes | Revenir à c arch |
On 4/9/2024 4:05 PM, MitchAlsup1 wrote:BGB wrote:
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.Where, 16 GPRs isn't really enough (lots of register spills), and 128 GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't going to happen).16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part of GPRs AND you have good access to constants.
On the main ISA's I had tried to generate code for, 16 GPRs was kind of a pain as it resulted in fairly high spill rates.
Though, it would probably be less bad if the compiler was able to use all of the registers at the same time without stepping on itself (such as dealing with register allocation involving scratch registers while also not conflicting with the use of function arguments, ...).
My code generators had typically only used callee save registers for variables in basic blocks which ended in a function call (in my compiler design, both function calls and branches terminating the current basic-block).
On SH, the main way of getting constants (larger than 8 bits) was via PC-relative memory loads, which kinda sucked.
This is slightly less bad on x86-64, since one can use memory operands with most instructions, and the CPU tends to deal fairly well with code that has lots of spill-and-fill. This along with instructions having access to 32-bit immediate values.Yes, x86 and any architecture (IBM 360, S.E.L. , Interdata, ...) that have
The vast majority of leaf functions use less than 16 GPRs, given one has
a SP not part of GPRs {including arguments and return values}. Once one starts placing things like memove(), memset(), sin(), cos(), exp(), log()
in the ISA, it goes up even more.
Yeah.
Things like memcpy/memmove/memset/etc, are function calls in cases when not directly transformed into register load/store sequences.My 66000 does not convert them into LD-ST sequences, MM is a single inst-
Did end up with an intermediate "memcpy slide", which can handle medium size memcpy and memset style operations by branching into a slide.MMs and MSs that do not cross page boundaries are ATOMIC. The entire system
As noted, on a 32 GPR machine, most leaf functions can fit entirely in scratch registers.Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without getting
On a 64 GPR machine, this percentage is slightly higher (but, not significantly, since there are few leaf functions remaining at this point).
If one had a 16 GPR machine with 6 usable scratch registers, it is a little harder though (as typically these need to cover both any variables used by the function, and any temporaries used, ...). There are a whole lot more leaf functions that exceed a limit of 6 than of 14.The data back in the R2000-3000 days indicated that 32 GPRs has a 15%+
But, say, a 32 GPR machine could still do well here.
Note that there are reasons why I don't claim 64 GPRs as a large performance advantage:
On programs like Doom, the difference is small at best.
It mostly effects things like GLQuake in my case, mostly because TKRA-GL has a lot of functions with a large numbers of local variables (some exceeding 100 local variables).
Partly though this is due to code that is highly inlined and unrolled and uses lots of variables tending to perform better in my case (and tightly looping code, with lots of small functions, not so much...).
Where, function categories:You are forgetting about FP, GOT, TLS, and whatever resources are required
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
to do try-throw-catch stuff as demanded by the source language.
Yeah, possibly true.
In my case:Can't do PASCAL and other ALOGO derived languages with block structure.
There is no frame pointer, as BGBCC doesn't use one;
All stack-frames are fixed size, VLA's and alloca use the heap;longjump() is at a serious disadvantage here. desctructors are sometimes hard to position on the stack.
GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
TLS, accessed via TBR.
Try/throw/catch:
Mostly N/A for leaf functions.
Any function that can "throw", is in effect no longer a leaf function.You do realize that there is a set of #define-s that can implement try-throw-catch without requiring any subroutines ?!?
Implicitly, any function which uses "variant" or similar is also, no longer a leaf function.
Need for GBR save/restore effectively excludes a function from being tiny-leaf. This may happen, say, if a function accesses global variables and may be called as a function pointer.------------------------------------------------------
One "TODO" here would be to merge constants with the same "actual" value into the same register. At present, they will be duplicated if the types are sufficiently different (such as integer 0 vs NULL).In practice, the upper 48-bits of a extern variable is completely shared
For functions with dynamic assignment, immediate values are more likely to be used. If the code-generator were clever, potentially it could exclude assigning registers to constants which are only used by instructions which can encode them directly as an immediate. Currently, BGBCC is not that clever.And then there are languages like PL/1 and FORTRAN where the compiler
Or, say:
y=x+31; //31 only being used here, and fits easily in an Imm9.
Ideally, compiler could realize 31 does not need a register here.
Well, and another weakness is with temporaries that exist as function arguments:
If static assigned, the "target variable directly to argument register" optimization can't be used (it ends up needing to go into a callee-save register and then be MOV'ed into the argument register; otherwise the compiler breaks...).
Though, I guess possible could be that the compiler could try to partition temporaries that are used exclusively as function arguments into a different category from "normal" temporaries (or those whose values may cross a basic-block boundary), and then avoid statically-assigning them (and somehow not cause this to effectively break the full-static-assignment scheme in the process).Brian's compiler finds the largest argument list and the largest return
Though, IIRC, I had also considered the possibility of a temporary "virtual assignment", allowing the argument value to be temporarily assigned to a function argument register, then going "poof" and disappearing when the function is called. Hadn't yet thought of a good way to add this logic to the register allocator though.
But, yeah, compiler stuff is really fiddly...More orthogonality helps.
Les messages affichés proviennent d'usenet.