Sujet : Re: "Mini" tags to reduce the number of op codes
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 09. Apr 2024, 21:01:50
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uv46rg$e4nb$1@dont-email.me>
References : 1 2 3 4 5 6 7 8
User-Agent : Mozilla Thunderbird
On 4/9/2024 1:24 PM, Thomas Koenig wrote:
I wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
Thomas Koenig wrote:
>
John Savard <quadibloc@servername.invalid> schrieb:
>
Thus, instead of having mode bits, one _could_ do the following:
>
Usually, have 28 bit instructions that are shorter because there's
only one opcode for each floating and integer operation. The first
four bits in a block give the lengths of data to be used.
>
But have one value for the first four bits in a block that indicates
36-bit instructions instead, which do include type information, so
that very occasional instructions for rarely-used types can be mixed
in which don't fill a whole block.
>
While that's a theoretical possibility, I don't view it as being
worthwhile in practice.
>
I played around a bit with another scheme: Encoding things into
128-bit blocks, with either 21-bit or 42-bit or longer instructions
(or a block header with six bits, and 20 or 40 bits for each
instruction).
>
Not having seen said encoding scheme:: I suspect you used the Rd=Rs1
destructive operand model for the 21-bit encodings. Yes :: no ??
>
It was not very well developed, I gave it up when I saw there wasn't
much to gain.
Maybe one more thing: In order to justify the more complex encoding,
I was going for 64 registers, and that didn't work out too well
(missing bits).
Having learned about M-Core in the meantime, pure 32-register,
21-bit instruction ISA might actually work better.
For 32-bit instructions at least, 64 GPRs can work out OK.
Though, the gain of 64 over 32 seems to be fairly small for most "typical" code, mostly bringing a benefit if one is spending a lot of CPU time in functions that have large numbers of local variables all being used at the same time.
Seemingly:
16/32/48 bit instructions, with 32 GPRs, seems likely optimal for code density;
32/64/96 bit instructions, with 64 GPRs, seems likely optimal for performance.
Where, 16 GPRs isn't really enough (lots of register spills), and 128 GPRs is wasteful (would likely need lots of monster functions with 250+ local variables to make effective use of this, *, which probably isn't going to happen).
*: Where, it appears it is most efficient (for non-leaf functions) if the number of local variables is roughly twice that of the number of CPU registers. If more local variables than this, then spill/fill rate goes up significantly, and if less, then the registers aren't utilized as effectively.
Well, except in "tiny leaf" functions, where the criteria is instead that the number of local variables be less than the number of scratch registers. However, for many/most small leaf functions, the total number of variables isn't all that large either.
Where, function categories:
Tiny Leaf:
Everything fits in scratch registers, no stack frame, no calls.
Leaf:
No function calls (either explicit or implicit);
Will have a stack frame.
Non-Leaf:
May call functions, has a stack frame.
There is a "static assign everything" case in my case, where all of the variables are statically assigned to registers (for the scope of the function). This case typically requires that everything fit into callee save registers, so (like the "tiny leaf" category, requires that the number of local variables is less than the available registers).
On a 32 register machine, if there are 14 available callee-save registers, the limit is 14 variables. On a 64 register machine, this limit might be 30 instead. This seems to have good coverage.
In the non-static case, the top N variables might be static-assigned, and the remaining variables dynamically assigned. Though, it appears this is more an artifact of my naive register allocator, and might not be as effective of a strategy with an "actually clever" register allocator (like those in GCC or LLVM), where purely dynamic allocation may be better (they are able to carry dynamic assignments across basic block boundaries, rather than needing to spill/fill everything whenever a branch or label is encountered).
...