On 8/28/24 11:36 PM, BGB wrote:
On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
[snip]
My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
It likely isn't going to happen because a 1-wide machine isn't going to have the needed register ports.
For an in-order implementation, banking could be used for saving
a contiguous range of registers with no bank conflicts.
Mitch Alsup chose to provide four read/write ports with the
typical use being three read, one write instructions. This not
only facilitates faster register save/restore for function calls
(and context switches/interrupts) but presents the opportunity of
limited dual issue ("CoIssue").
I do not know what the power and area costs are for read/write vs.
dedicated ports nor for four ports versus three ports. I suspect
three-read, one-write instuctions are not common generally and
often a read can be taken from the forwarding network or by
stealing a read port from a later instruction (that only needed
one read port). (One could argue that not introducing a
performance hiccup in uncommon cases would justify the modest
cost of the extra port. Reducing performance to two-thirds in the
case where every instruction is three-read and a read can only be
stolen from a following three-read instruction — and this still
requires buffering of the reads and complexity of scheduling,
which might hurt frequency.)
At some point of width, undersupplying register ports makes sense
both because port cost increases with count and wider issue is
unlikely to support N wide for every instruction type (and larger
samples are more likely to have fewer cases where the structural
hazard of fewer than worst case register ports will be
encountered). Adding out-of-order execution further reduces the
performance impact of hazards.
(A simple one-wide pipeline stalls for any hazard. An in-order
two-wide pipeline would not always dual issue due to dependencies
even with perfect cache, so adding extra stalls from only having
four register read ports, e.g., would not hurt performance as
much as a two read port limit for a scalar design. Out-of-order
tends to further average out resource use.)
[I think more cleverly managing communication and storage has
potential for area and power saving. Repeating myself, any-to-any
communication seems expensive and much communication is more
local. The memory hierarchy typically has substantial emphasis
on "network locality" and not just spatial and temporal locality
(separate instruction and data caches, per core caches/registers),
but I believe there is potential for improving communication and
storage "network locality" within a core. Sadly, I have not
**worked** on how communication might be improved.]
But, if one doesn't have the register ports, there is likely no viable way to move 4 registers/cycle to/from memory (and it wouldn't make sense for the register file to have a path to memory that is wider than what the pipeline has).
My 66000's VVM encourages wide cache access even with relatively
narrow execution resources. A simple vector MADD could use four
times the cache bandwidth (in register width) as execution
bandwidth (in scalar operations), so loading/storing four
sequential-in-memory values per cycle could keep a single MADD
unit busy.