On 3/6/24 3:00 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
[snip]
It seems that 64-bit
stack-pointer-relative accesses could be roughly as fast by using
the offset as the index (each stack frame would be comparable to a
different thread register context; the tradeoffs of extra storage
for multiple stack frames ("multithreading" — alternating between
indexing up and indexing down would provide some utilization
flexibility with low indexing overhead) relative to pushing out
early frames (normal "context switch"); such a cache would
probably be limited in frame size cached.
Smells too much like register windows which never outperformed
the flat RF from MIPS. In any event, 50% of subroutines need no
stack <accesses> and those that do typically only store 3 registers
(for restore later).
Register windows were intended to avoid save/restore overhead by
retaining values in registers with renaming. A stack cache is
meant to reduce the overhead of loads and stores to the stack —
not just preserving and restoring registers. A direct-mapped stack
cache is not entirely insane. A partial stack frame cache might
cache up to 256 bytes (e.g.) with alternating frames indexing with
inverted bits (to reduce interference) — one could even reserve a
chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
offset cached to be smaller than the cache.
Such might be more useful than register windows, but that does
not mean that it is actually a good option.
An L2 register set that can only be accessed for one operand might be somewhat similar to LD-OP.
In high speed designs, there are at least 2 cycles of delay from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access sees at
least 4 cycles of latency, 5 if you count AGEN.
Presumably this is related to the storage technology used as well as the capacity. With higher-speed ("register") cells, lower porting might be able to substantially compensate for higher bit-capacity in terms of area/latency/power. The specialized caches/L2
register sets would presumably also be smaller than 32 KiB of
typical L1 caches.
In an out-of-order design, the "L2 register set" might be read
before the operation is scheduled for execution, possibly
providing one or two extra cycles for latency. (An insane hoisting
could use something like a BTB to provide an index before the
instruction itself is available.) Doing the reads from the main
storage in-order might also facilitate software-controlled
banking, but that might not be useful.
(If many of the updates read the old value, this might make
preserving the old values outside of the main store practical.
Having the current value in the main store would increase its
use — rather than accessing a store queue with ready values that
have not yet committed. This technique could, of course, also
apply to registers, allowing an expensive multiported storage
element to be freed before commitment of an overwriting
instruction. Read-and-update would also be useful for ECC in a
data cache when the write is smaller than the ECC granule.)
It is not obvious (to me) that such an intermediate capacity,
simply-indexed value store would be beneficial, but that it would
not be beneficial is also not obvious (to me). The design space
seems large and workload characteristics and compiler output would
likely have a significant impact on utility.