Liste des Groupes | Revenir à c arch |
On Fri, 16 Aug 2024 4:30:54 +0000, Brett wrote:
Stephen Fuld <sfuld@alumni.cmu.edu.invalid> wrote:On 8/14/2024 5:54 PM, Brett wrote:Brett <ggtgp@yahoo.com> wrote:MitchAlsup1 <mitchalsup@aol.com> wrote:On Mon, 12 Aug 2024 2:23:00 +0000, Brett wrote:
BGB <cr88192@gmail.com> wrote:
Another benefit of 64 registers is more inlining removing calls.
A call can cause a significant amount of garbage code all around that
call,
as it splits your function and burns registers that would otherwise get
used.
What I see around calls is MOV instructions grabbing arguments from the
preserved registers and putting return values in to the proper preserved
register. Inlining does get rid of these MOVs, but what else ??
For middling functions, I spent my time optimizing heavy code, the 10%
that
matters.
The first half of a big function will have some state that has to be
reloaded after a call, or worse yet saved and reloaded.
Inlining is limited by register count, with twice the registers the
compiler will generate far larger leaf calls with less call depth. Which
removes more of those MOVs.
I can understand the reluctance to go to 6 bit register specifiers, it
burns up your opcode space and makes encoding everything more difficult.
I am on record as stating the proper number of bits in an instruction-
specifier is 34-bits. This is after designing Mc88K ISA, doing 3
generations
of SPARC chips, 7 years of x86-64, and Samsung GPU (and my own efforts)
Making the registers 6-bits would increase that count to 36-bits.
My 66000 hurts less with 6-bits as more constants bits get moved to
extension words, which is almost free by most metrics.
Only My 66000 can reasonably be able to implement 6-bits register
specifiers.
The market is yours for the taking.
6-bits will make you stand out and get noticed.
The only down side I see is a few percent in code density.
Actually due to the removal of MOVs and reloads the code density may be
basically the same.
Anytime one removes more "MOVs and saves and restore" instructions
than the called subroutine contains within the prologue and epilogue
bounds, the subroutine should be inlined.Also longer context switch times, as more registers to save/restore.
The save is should be free, as the load from ram is so slow.
When HW is doing the saves, the saves can be performed while
waiting for the first instruction to arrive and for the first
registers to arrive. Thus, done in HW, the saves are essentially
free.
If the context is time critical it should be written to use the
registers that are reloaded first, first. In which case the code
could start doing work in the same amount of time regardless of
register count. (I doubt the CPU design is actually that smart,
or that the people that program the interrupts are.)
When HW is doing the saves, it does them in a known order and
can mark the registers "in use" or "busy" instantaneously and
clear that status as data arrives. When SW is doing the same,
SW ahs to wait for the instruction to arrive and then do them
one-to-small numbers at a time. HW is not so constrained.
For example a 1-wide machine with a 4-ported register file,
generally operated as 3R1W can be switched to 4R or 4W for
epilogue or prologue uses respectively. Simulation indicates
this gets rid of 47% of the cycles spent in prologue and
epilogue (combined compared to a sequence of stores and loads)
Simulation also indicates that 42% of the power is saved--
mainly from Tag and TLB non-access cycles.
Les messages affichés proviennent d'usenet.