On 3/8/24 11:17 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
[snip]
Register windows were intended to avoid save/restore overhead by
retaining values in registers with renaming. A stack cache is
meant to reduce the overhead of loads and stores to the stack —
not just preserving and restoring registers. A direct-mapped stack
cache is not entirely insane. A partial stack frame cache might
cache up to 256 bytes (e.g.) with alternating frames indexing with
inverted bits (to reduce interference) — one could even reserve a
chunk (e.g., 64 bytes) of a frame and not overlapped by limiting
offset cached to be smaller than the cache.
Such might be more useful than register windows, but that does
not mean that it is actually a good option.
If it is such a good option why has it not reached production ??
(might be) more useful than register windows is not the same as
providing a net benefit when considering the entire system.
One obvious issue with a small stack cache is utilization. While
generic data caches also have utilization issues (no single size
is ideal for all workloads) and the stack cache would be small
(and potentially highly prefetchable), the spilling and filling
overhead at entering and exiting stack frames could be much
greater than the savings from simple addressing (and permission
checks) if few accesses are made within the cached part of the
stack frame between frame spills and fills.
A latency optimized partial frame stack cache would also benefit
from specific sizes of higher utilization regions of stack frames
with longish frame active periods, so compiler-based optimization
would be a factor. Depending on microarchitecture-specific
compiler optimization for good performance is generally avoided.
This is related to software distribution format. If aliasing was
not avoided by architectural contract — which would be difficult
for any existing ISA — then handling aliases would also introduce
overhead. (For higher utilization, one might want to avoid caching
the registers saved at function entry, assuming these are colder
and less latency sensitive than other values in the frame. Since
the amount of the frame used by saved registers would vary, a
hardware-friendly fixed uncached chunk would either waste capacity
on cold saved registers when more registers are saved or make some
potentially warmer values uncached [in the stack cache]. Updating
the stack pointer to hide saved register would address this but
would presumably introduce other issues.)
Another factor that would reduce the attractiveness of specialized
caches is the use of out-of-order execution. OoOE helps hide
latency, so any latency benefit is less important.
Not all optimization opportunities are implemented even when they
do not conflict excessively. Part of this is the complexity and
risks of adding new features.
On 3/6/24 3:00 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
An L2 register set that can only be accessed for one operand might be somewhat similar to LD-OP.
>
In high speed designs, there are at least 2 cycles of delay from AGEN
to the L2 and 2 cycles of delay back. Even zero cycle access sees at
least 4 cycles of latency, 5 if you count AGEN.
There seems to have been confusion. I wrote "L2 _register_ set".
Being able to access a larger register name space for one operand
might be useful when value reuse often has moderate temporal
locality.
Such an L2 register set is even more complicated than load-op in
terms of compiler optimization.
Renaming a larger name space of (L2) registers would also
introduce issues. I suspect something more like a Load-Store Queue
would be used rather than a register alias table. The benefits
from specialization (e.g., smaller tags from the smaller address
space than general memory for LSQ) would conflict with the
utilization benefits of only having an LSQ.
Physical placement would also involve tradeoffs of latency (and
access energy) relative to L1 data cache. Giving prime real estate
to an L2 register file would increase L1 latency (and access
energy).
Dynamic scheduling would also be a little more complicated by
adding another latency consideration, and using banking rather
than multiporting — which becomes more reasonable at larger
capacities — would add more latency variability.
It does seem *to me* that there should be a benefit from a storage
region of intermediate capacity with simpler addressing than
general memory.
>> Presumably this is related to the storage technology used as
>> well as the capacity.
>
> Purely wire delay due to the size of the L2 cache.
Wire delay due to physical size is related to storage technology
as well as capacity. E.g., DRAM can be denser than SRAM and thus
lower latency at larger sizes even when array access is slower.
Single-ported register storage technology would (I ass_me) be even
less dense than SRAM, such that there would be some capacity where
latency would be better with SRAM even when register storage would
be faster at the array level. Of course, latency is not the only
consideration for storage.