On 9/11/2024 8:54 AM, Robert Finch wrote:
On 2024-09-10 5:07 p.m., BGB wrote:
On 9/10/2024 9:58 AM, Robert Finch wrote:
On 2024-09-10 3:00 a.m., BGB wrote:
>
...
>
Can note: Annoyingly, despite claims of the RV 'C' extensions' cheapness to decode, it still has an annoyingly high LUT impact. It was worse, but was able to slightly reduce it in the immediate decoding (as-is, it initially decodes immediate values into 12 bit forms, and then does a final extension to 33 bits, with a special case for the LUI immediate).
>
Despite the encoding's seeming attempt to limit how much the bits move around, having the bits move around significantly still results in burning a lot of LUTs on MUX'ing.
>
>
I have found that there can be a lot of registers available if they are implemented in BRAMs. BRAMs have lots of depth compared to LUT RAMs. BRAMs have a one cycle latency but that is just part of the pipeline. In Q+ about 40k LUTs are being used just to keep track of registers. (rename mappings and checkpoints).
If I used BRAMs with the current approach, I would likely need to burn 36 BRAMs, this would be a bit steep,
Given a lot of available registers I keep considering trying a VLIW design similar to the Itanium, rotating register and all. But I have a lot invested in OoO.
Flat register space in my case.
I considered a banked register set for ISR handling.
This has a mechanism that it will trigger a stall and use the stall to write registers back to a backing buffer or fetch them from a backing buffer.
Could potentially support a bigger register space by using a caching-like approach:
R1$, 32 or 64 regs, 6R3W native.
R2$, 256 .. 1024 regs, 1R1W
If using a register not currently in the R1$, it may write the existing register to the R2$ and fetch the requested register from the R2$.
Q+ has seven in-order pipeline stages before things get to the re-order buffer. Fetch (get instruction from I$), mux (for inserting interrupts and micro-code), vec (vector instruction expand), pac (pack nops resulting from expand), dec (decode), ren (rename registers), then que (queue to reorder buffer). After in-order stages are issue, execute, and commit.
FWIW: I have an 8 stage pipeline:
PF IF ID1 ID2 EX1 EX2 EX3 WB
Or:
PF IF ID RF EX1 EX2 EX3 WB
I wanted to go for an out-of-order design to hide memory latency, even if the more complex design ran at a lower clock rate. I have seen the CPU execute/complete up to about eight instructions in SIM after a store begins for instance. Stores taking about four clock cycles. Loads take longer. The CPU really eats up the stores in function prolog code. Starts multiple store operations before any complete. With a large enough store queue, it can begin executing the instructions after the prolog code.
In my testing, the current speed seemed near optimal (for what I can pull off).
If I make clock-speed faster at the expense of cache, the increase in penalties is enough to eat any gains.
If I made clock-speed slower, even with a 100% L1 hit-rate, overall performance would be lower (limited more by how quickly instructions can be executed).
So, say, a 25 MHz core with a perfect hit-rate (and low instruction latency) would be slower than a 50 MHz core with a stalling pipeline and ~ 95% hit rate.
But, a 50 MHz core with 95% hit-rate is faster than a 75 MHz core at 70-80% hit rate (if shrinking the L1 caches).
Though, potentially, 25 MHz could beat 50 MHz if I could get around 3.0 to 3.5 IPC (say, with a 4W OoO design).
I am almost left wondering if I would be better off focusing on 2-wide in-order, except that the overall cost difference isn't that large (I could save a similar amount of LUTs mostly by disabling the Shift-Add unit and FP-SIMD unit).
So, a "notably cheaper" core would require multiple compromises:
2-wide with a 4R2W register file;
Dropping 64-bit integer multiply and divide
But, these are needed for RV64 'M', *;
Dropping fast FP-SIMD;
Also dropping FP divide and square-root.
But, simply dropping to 2 wide doesn't save much.
Whereas 3-wide avoids some penalty cases, but it does seem like a waste to have a 3-wide core for the 3rd lane to only really be used for spare register ports and the occasional ALU instruction (but, I ended up stripping off most other functionality from the 3rd lane, given as noted it is infrequently used so can't really justify the cost of supporting much beyond basic ALU ops and similar).
But, OTOH, a 2-wide core could not support operations that need 3x 128-bit inputs (such as parts of the 128-bit SIMD ISA, or the XMOV extension).
*: Also GCC only allows RV in various known configurations, and annoyingly "RV64IFD or RV64ID" isn't valid as far as GCC is concerned (in effect, one is not allowed to have FPU without integer multiply and divide, and for 64-bit, it needs to provide 64-bit multiply and divide; and the Shift-Add unit was the cheapest way I could come up to do so, but still isn't particularly cheap).
Granted, the FPU is also expensive, but also "more necessary"...
For cheaper cases, almost better off focusing on RV64I or RV32IM.
Though, at present, I don't have any dedicated RV64I or RV32IM cores that work on the ringbus (existing options mostly using AHB/AXI/Wishbone/etc).
Main merit for RV32IM is that it would be easier to fit into an XC7S25 or similar (but, I haven't used the XC7S25 much as the main board I have with this FPGA also lacks any external RAM).
There are some boards with an XC7A35T and a 512K RAM module.
Not aware of any still available boards with this FPGA and a DDR RAM module. There used to be the Arty A35T, but IIRC this board got dropped (and the Basys3 board seems to lack external RAM).
Also, seemingly, at this point one is hard-pressed to find sub $100 FPGA boards that also have a RAM module and don't require a dedicated JTAG programming cable (one generally needs a JTAG cable for the QMTECH boards, and generally there is seemingly no way to boot the FPGA without a JTAG cable, though they did sell companion boards with a built in RP2040 with the idea that one can stick the bitstream into the RP2040 and use a special ROM to initialize the FPGA, ..., but not had much luck making this part work). (Where, the Flash in the RP2040 is just barely big enough to fit a bitstream file).
But, I guess it lessens the need for smaller cores if the FPGA boards that would need the smaller cores are falling off the bottom (and most of the still remaining boards are big enough to run the BJX2 core in some form; and all the exceptions with smaller FPGAs lack RAM rendering it moot to try to do anything beyond a small microcontroller with them).
...