On 8/30/2024 7:11 PM, Paul A. Clayton wrote:
On 8/28/24 11:36 PM, BGB wrote:
On 8/28/2024 11:40 AM, MitchAlsup1 wrote:
[snip]
My 1-wide machines does ENTER and EXIT at 4 registers per cycle.
Try doing 4 LDs or 4 STs per cycle on a 1-wide machine.
>
>
It likely isn't going to happen because a 1-wide machine isn't going to have the needed register ports.
For an in-order implementation, banking could be used for saving
a contiguous range of registers with no bank conflicts.
Mitch Alsup chose to provide four read/write ports with the
typical use being three read, one write instructions. This not
only facilitates faster register save/restore for function calls
(and context switches/interrupts) but presents the opportunity of
limited dual issue ("CoIssue").
I was mostly doing dual-issue with a 4R2W design.
Initially, 6R3W won out mostly because 4R2W disallows an indexed store to be run in parallel with another op; but 6R3W did allow this. This scenario made enough of a difference to seemingly justify the added cost of a 3-wide design with a 3rd lane that goes mostly unused (and is mostly limited to register MOV's and basic ALU ops and similar).
But, then this leads to an annoyance:
As is, I will need to generate different code for 1W, 2W, and 3W configurations;
It is starting to become tempting to generate code resembling that for the 1W case (albeit still using the shuffling that would be used when bundling), and then using superscalar since, it turns out, it is not quite as expensive as I had thought).
With superscalar, I wouldn't have the issue of 2W and 3W cores having issues running code built for the other.
Also, on both 2W and 3W configurations, I can have a 128-bit MOV.X (load/store pair) instruction, so if one assumes 2-wide as the minimum, this instruction can be safely assumed to exist.
I can mostly ignore 1-wide scenarios (2R1W and 3W1W), mostly as I have ended up mostly deciding to relegate these to RISC-V.
By the time I have stripped down BJX2 enough to fit into a small FPGA, it essentially has almost nothing to offer that RV wouldn't offer already (and it makes more practical sense to use something like RV32IM or similar).
I am not sure how one would efficiently pull off a 4W write operation.
Can note that generally, the GPR part of the register file can be built with LUTRAMs, which on Xilinx chips have the property:
1R1W, 5-bit addr, 3-bit data; comb read, clock-edge write.
1R1W, 6-bit addr, 2-bit data; comb read, clock-edge write.
This means, the number of LUTRAMs needed for NxM with G registers can be calculated:
2R1W, 32, Cost=44
3R1W, 32, Cost=66
4R2W, 32, Cost=176
6R3W, 32, Cost=396
4R4W, 32, Cost=352
6R4W, 32, Cost=528
2R1W, 64, Cost=64
3R1W, 64, Cost=96
4R2W, 64, Cost=256
6R3W, 64, Cost=576
4R4W, 64, Cost=512
6R4W, 64, Cost=768
10R5W, 64, cost=1600.
There is also the mUX logic and similar, but should follow the same pattern.
There is a bit-array (2b per register) to indicate which of the arrays holds each register. This ends up turning into FFs, but doesn't matter as much.
In the Verilog, one can write it as-if there were only 1 array per write port, with the duplication (for the read ports) handled transparently by the synthesis stage (convenient), although it still has a steep resource cost.
I think Altera uses a different system, IIRC with 4 or 8 bit addresses, 4-bit data, and read/write need clock-edges (as with Block RAM on Xilinx). When I tried experimentally to build for an Altera FPGA, I switched over to doing all the GPRs with FF's and state machines, as ironically this was cheaper than the code synthesized for LUTRAMs.
The core took up pretty much the whole FPGA when I told it to target a DE10 Nano (I don't actually have one, so this was a what if). Though, I do remember that (despite the very inefficient resource usage), its "Fmax" value was somewhat higher than I am generally running at.
Where, for FF based registers, it was a state machine something like:
output[63:0] regOut;
input[63:0] regInA;
input[6:0] regIdA;
input[63:0] regInB;
input[6:0] regIdB;
input[63:0] regInC;
input[6:0] regIdC;
input[6:0] regIdSelf;
input isHold;
input isFlush;
reg[63:0] regVal;
assign regOut=regVal;
reg isA;
reg isB;
reg isC;
reg tDoUpd;
reg[63:0] tValUpd;
always @*
begin
isA=regIdA==regIdSelf;
isB=regIdB==regIdSelf;
isC=regIdC==regIdSelf;
tDoUpd=0;
tValUpd=64'hXXXX_XXXX_XXXX_XXXX;
casez({isFlush,isA,isB,isC})
4'b1zzz: begin end
4'b01zz: begin tValUpd=regInA; tDoUpd=1; end
4'b001z: begin tValUpd=regInB; tDoUpd=1; end
4'b0001: begin tValUpd=regInC; tDoUpd=1; end
4'b0000: begin end
endcase
end
always @(posedge clock)
begin
if(tDoUpd && !isHold)
begin
regVal <= tValUpd;
end
end
With each read port being a case block:
case(regIdRs)
JX2_GR_R2: tRegValRsA0=regValR2;
JX2_GR_R3: tRegValRsA0=regValR3;
...
case(regIdRt)
JX2_GR_R2: tRegValRtA0=regValR2;
JX2_GR_R3: tRegValRtA0=regValR3;
...
...
This works, but has a fairly steep per-register cost.
Cost in this case seems to be more dominated by the number of read-ports and the number of registers (write ports seem to be comparably cheap in this scenario).
Then, there is the forwarding logic, with a cost function mostly dependent on the product of the number of read ports and pipeline EX stages (and WB).
So, each 4-way MUX eats 1 LUT.
...
Decided to leave out trying to estimate per-port LUT cost, as it quickly gets pretty hairy (but, very crude, is ~ 2.5 levels of 4-way MUX per port).
But, assuming FF's, and 64 GPRs, one would need an additional 3 levels of MUX4's per port (or, around 1344 LUTs per port).
Or, ~ 8K LUTs for a 6R register file.
These costs also scale linear based on the size of the registers, so 32 bit registers would roughly halve the LUT cost here.
I do not know what the power and area costs are for read/write vs.
dedicated ports nor for four ports versus three ports. I suspect
three-read, one-write instuctions are not common generally and
often a read can be taken from the forwarding network or by
stealing a read port from a later instruction (that only needed
one read port). (One could argue that not introducing a
performance hiccup in uncommon cases would justify the modest
cost of the extra port. Reducing performance to two-thirds in the
case where every instruction is three-read and a read can only be
stolen from a following three-read instruction — and this still
requires buffering of the reads and complexity of scheduling,
which might hurt frequency.)
I am not sure about ASIC.
For FPGA, pretty sure that bidirectional ports would gain little or nothing over fixed-direction ports (since bidirectional IO is not a thing, and the internal logic is almost entirely different between a read and write port).
If I were using a 6R4W register file, could probably justify a 256-bit Load/Store Quad (say, "MOV.Y"). Would be annoying though as it would likely only be usable with certain registers and with a 16-byte alignment.
As-is, MOV.X forbids:
R1:R0 //SPR (R0 and R1 are SPRs)
R15:R14 //SPR (R15 is SPR)
R33:R32 //Decoder wonk (*1) (Encoded as R1, SPR)
R47:R46 //Decoder wonk (*1) (Encoded as R15, SPR)
*1: Mostly because the 128-bit ops had encoded high registers originally by folding the high bit of the register into the low bit (thus allowing them to fit in the original 5-bit register field). With XGPR and XG2, it should also be possible to encode the registers without folding the bit.
The other case is to detect and special-case these in the decoder, but the easier option was to disallow it in the compiler (if encountered the compiler will break these cases into pairs of 64-bit Load/Store).
At some point of width, undersupplying register ports makes sense
both because port cost increases with count and wider issue is
unlikely to support N wide for every instruction type (and larger
samples are more likely to have fewer cases where the structural
hazard of fewer than worst case register ports will be
encountered). Adding out-of-order execution further reduces the
performance impact of hazards.
(A simple one-wide pipeline stalls for any hazard. An in-order
two-wide pipeline would not always dual issue due to dependencies
even with perfect cache, so adding extra stalls from only having
four register read ports, e.g., would not hurt performance as
much as a two read port limit for a scalar design. Out-of-order
tends to further average out resource use.)
Yeah.
In general, I am getting average-case bundle-widths of around 1.3 to 1.6, mostly scalar instructions and 2 wide bundles.
In my case, it is an in-order pipeline that stalls for pretty much anything (L1 miss, register dependency, etc).
Had noted before that the SWeRV core takes a different approach (still nominally in-order), and seems a little less sensitive to these issues. However, its resource cost is fairly steep and it runs at a lower clock speed (along with being 32-bit).
It seems a lot of other 32-bit RISC-V cores are a bit cheaper, but (like my core) still have the property of stalling on pretty much everything.
There are some cores that manage to reduce cost (for RV32I) at the expense of performance, by implementing them internally as 64x 16-bit registers, and using multiple cycles to execute each instruction (say, each RV32 op is broken into 2..4 16-bit operations, with a minimum of 4 CPI). I guess this can make sense if one is doing a microcontroller where performance doesn't really matter.
Looks like the Hazard3 core in the newer RasPi Pico is full 32-bit though, and is designed around a shorter (3 stage) pipeline (Fetch/Decode/Execute AFAICT).
[I think more cleverly managing communication and storage has
potential for area and power saving. Repeating myself, any-to-any
communication seems expensive and much communication is more
local. The memory hierarchy typically has substantial emphasis
on "network locality" and not just spatial and temporal locality
(separate instruction and data caches, per core caches/registers),
but I believe there is potential for improving communication and
storage "network locality" within a core. Sadly, I have not
**worked** on how communication might be improved.]
General idea is to try to minimize areas where communication is needed, but this leads to a weak memory model, with pros/cons.
One significant limiting factor on the clock speed of a simple in-order core seems to be the existence of a global stall/hold signal. But, avoiding this adds a lot of new complexities (things no longer travel the pipeline in a lockstep order, and one needs some way to allow the FUs to deliver their results to the register file).
From what I have seen, it is sort of like:
EX stages just run full speed, no stalls;
Instructions are gated, allowed into EX once their registers are available, and any destination register is marked unavailable;
When the instruction finishes, it submits its result back to the register file, and the register becomes available again (allowing any dependent instructions to enter the pipeline);
Some units, like memory Load/Store, would turn into a FIFO, with new requests being added to the FIFO, and results being written back to registers;
There is a mechanism to arbitrate getting FU outputs mapped back to register-file write ports;
...
Seems pretty sensible, and I "could" try to do a core like this.
Would likely be pretty much a ground-up redesign though, and the main examples I have of cores implemented using this design seem to be fairly heavyweight and have slower clock-speeds (despite avoiding the need for a global stall signal).
Though, I haven't looked too much into this.
If I were to do a core this way, it would likely make sense to have an extra write port, mostly so that simple operations (like ALU) can used fixed write ports (as in an in-order implementation), with another write port mostly to deal with operations that may complete asynchronously (such as memory loads which experienced an L1 miss).
Might make sense to prioritize a 2-wide design, so likely a 4R3W register file (and go over to superscalar).
There may need to be a mechanism on this port to deal with cases where multiple FU's generate a result on this port on the same clock-cycle (such as a small FIFO), likely with a gating mechanism to not allow any new operations in this class to begin if the FIFO is backed up.
Though, if I allow L1 misses to be handled separately from L1 hits, this could lead to paradoxical instruction reordering if new instructions are allowed into the EX pipeline before the memory load finishes (this likely would not happen if all mem ops go through a FIFO, as reordering independent register-only ops should be invisible).
Likely it would also make sense for any "high latency" instructions to also be treated as asynchronous (probably including any scalar FPU operations).
But, dunno.
If I implemented such a core, it would probably ignore the WEX hinting, since it would likely no longer serve a useful purpose in this case.
But, if one doesn't have the register ports, there is likely no viable way to move 4 registers/cycle to/from memory (and it wouldn't make sense for the register file to have a path to memory that is wider than what the pipeline has).
My 66000's VVM encourages wide cache access even with relatively
narrow execution resources. A simple vector MADD could use four
times the cache bandwidth (in register width) as execution
bandwidth (in scalar operations), so loading/storing four
sequential-in-memory values per cycle could keep a single MADD
unit busy.
OK.