Newsportal USENET - Re: Tonights Tradeoff

On 9/11/2024 8:54 AM, Robert Finch wrote:

On 2024-09-10 5:07 p.m., BGB wrote:
On 9/10/2024 9:58 AM, Robert Finch wrote:
On 2024-09-10 3:00 a.m., BGB wrote:
>

...

>
Can note: Annoyingly, despite claims of the RV 'C' extensions' cheapness to decode, it still has an annoyingly high LUT impact. It was worse, but was able to slightly reduce it in the immediate decoding (as-is, it initially decodes immediate values into 12 bit forms, and then does a final extension to 33 bits, with a special case for the LUI immediate).
>
Despite the encoding's seeming attempt to limit how much the bits move around, having the bits move around significantly still results in burning a lot of LUTs on MUX'ing.
>
>
I have found that there can be a lot of registers available if they are implemented in BRAMs. BRAMs have lots of depth compared to LUT RAMs. BRAMs have a one cycle latency but that is just part of the pipeline. In Q+ about 40k LUTs are being used just to keep track of registers. (rename mappings and checkpoints).

If I used BRAMs with the current approach, I would likely need to burn 36 BRAMs, this would be a bit steep,

Given a lot of available registers I keep considering trying a VLIW design similar to the Itanium, rotating register and all. But I have a lot invested in OoO.

Flat register space in my case.
I considered a banked register set for ISR handling.
This has a mechanism that it will trigger a stall and use the stall to write registers back to a backing buffer or fetch them from a backing buffer.
Could potentially support a bigger register space by using a caching-like approach:
R1$, 32 or 64 regs, 6R3W native.
R2$, 256 .. 1024 regs, 1R1W
If using a register not currently in the R1$, it may write the existing register to the R2$ and fetch the requested register from the R2$.

Q+ has seven in-order pipeline stages before things get to the re-order buffer. Fetch (get instruction from I$), mux (for inserting interrupts and micro-code), vec (vector instruction expand), pac (pack nops resulting from expand), dec (decode), ren (rename registers), then que (queue to reorder buffer). After in-order stages are issue, execute, and commit.

FWIW: I have an 8 stage pipeline:
PF IF ID1 ID2 EX1 EX2 EX3 WB
Or:
PF IF ID RF EX1 EX2 EX3 WB

I wanted to go for an out-of-order design to hide memory latency, even if the more complex design ran at a lower clock rate. I have seen the CPU execute/complete up to about eight instructions in SIM after a store begins for instance. Stores taking about four clock cycles. Loads take longer. The CPU really eats up the stores in function prolog code. Starts multiple store operations before any complete. With a large enough store queue, it can begin executing the instructions after the prolog code.

In my testing, the current speed seemed near optimal (for what I can pull off).
If I make clock-speed faster at the expense of cache, the increase in penalties is enough to eat any gains.
If I made clock-speed slower, even with a 100% L1 hit-rate, overall performance would be lower (limited more by how quickly instructions can be executed).
So, say, a 25 MHz core with a perfect hit-rate (and low instruction latency) would be slower than a 50 MHz core with a stalling pipeline and ~ 95% hit rate.
But, a 50 MHz core with 95% hit-rate is faster than a 75 MHz core at 70-80% hit rate (if shrinking the L1 caches).
Though, potentially, 25 MHz could beat 50 MHz if I could get around 3.0 to 3.5 IPC (say, with a 4W OoO design).
I am almost left wondering if I would be better off focusing on 2-wide in-order, except that the overall cost difference isn't that large (I could save a similar amount of LUTs mostly by disabling the Shift-Add unit and FP-SIMD unit).
So, a "notably cheaper" core would require multiple compromises:
   2-wide with a 4R2W register file;
   Dropping 64-bit integer multiply and divide
   But, these are needed for RV64 'M', *;
   Dropping fast FP-SIMD;
   Also dropping FP divide and square-root.
But, simply dropping to 2 wide doesn't save much.
   Whereas 3-wide avoids some penalty cases, but it does seem like a waste to have a 3-wide core for the 3rd lane to only really be used for spare register ports and the occasional ALU instruction (but, I ended up stripping off most other functionality from the 3rd lane, given as noted it is infrequently used so can't really justify the cost of supporting much beyond basic ALU ops and similar).
But, OTOH, a 2-wide core could not support operations that need 3x 128-bit inputs (such as parts of the 128-bit SIMD ISA, or the XMOV extension).
*: Also GCC only allows RV in various known configurations, and annoyingly "RV64IFD or RV64ID" isn't valid as far as GCC is concerned (in effect, one is not allowed to have FPU without integer multiply and divide, and for 64-bit, it needs to provide 64-bit multiply and divide; and the Shift-Add unit was the cheapest way I could come up to do so, but still isn't particularly cheap).
Granted, the FPU is also expensive, but also "more necessary"...
For cheaper cases, almost better off focusing on RV64I or RV32IM.
Though, at present, I don't have any dedicated RV64I or RV32IM cores that work on the ringbus (existing options mostly using AHB/AXI/Wishbone/etc).
Main merit for RV32IM is that it would be easier to fit into an XC7S25 or similar (but, I haven't used the XC7S25 much as the main board I have with this FPGA also lacks any external RAM).
There are some boards with an XC7A35T and a 512K RAM module.
Not aware of any still available boards with this FPGA and a DDR RAM module. There used to be the Arty A35T, but IIRC this board got dropped (and the Basys3 board seems to lack external RAM).
Also, seemingly, at this point one is hard-pressed to find sub $100 FPGA boards that also have a RAM module and don't require a dedicated JTAG programming cable (one generally needs a JTAG cable for the QMTECH boards, and generally there is seemingly no way to boot the FPGA without a JTAG cable, though they did sell companion boards with a built in RP2040 with the idea that one can stick the bitstream into the RP2040 and use a special ROM to initialize the FPGA, ..., but not had much luck making this part work). (Where, the Flash in the RP2040 is just barely big enough to fit a bitstream file).
But, I guess it lessens the need for smaller cores if the FPGA boards that would need the smaller cores are falling off the bottom (and most of the still remaining boards are big enough to run the BJX2 core in some form; and all the exceptions with smaller FPGAs lack RAM rendering it moot to try to do anything beyond a small microcontroller with them).
...

Date	Sujet	#	Auteur
7 Sep 24	Tonights Tradeoff	99	Robert Finch
7 Sep 24	Re: Tonights Tradeoff	98	MitchAlsup1
8 Sep 24	Re: Tonights Tradeoff	97	Robert Finch
8 Sep 24	Re: Tonights Tradeoff	96	MitchAlsup1
10 Sep 24	Re: Tonights Tradeoff	95	Robert Finch
10 Sep 24	Re: Tonights Tradeoff	17	BGB
10 Sep 24	Re: Tonights Tradeoff	12	Robert Finch
10 Sep 24	Re: Tonights Tradeoff	10	BGB
11 Sep 24	Re: Tonights Tradeoff	9	Robert Finch
11 Sep 24	Re: Tonights Tradeoff	7	Stephen Fuld
11 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	5	Robert Finch
12 Sep 24	Re: Tonights Tradeoff	4	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	3	Robert Finch
12 Sep 24	Re: Tonights Tradeoff	2	MitchAlsup1
13 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	1	BGB
11 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
11 Sep 24	Re: Tonights Tradeoff	4	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	3	Thomas Koenig
12 Sep 24	Re: Tonights Tradeoff	2	BGB
12 Sep 24	Re: Tonights Tradeoff	1	Robert Finch
11 Sep 24	Re: Tonights Tradeoff	77	MitchAlsup1
15 Sep 24	Re: Tonights Tradeoff	76	Robert Finch
16 Sep 24	Re: Tonights Tradeoff	75	Robert Finch
24 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	74	Robert Finch
24 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	73	MitchAlsup1
26 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	72	Robert Finch
26 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	71	MitchAlsup1
27 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	70	Robert Finch
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	69	Robert Finch
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	66	Anton Ertl
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	65	Robert Finch
5 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	64	Anton Ertl
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	63	Robert Finch
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	3	MitchAlsup1
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	Robert Finch
12 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	BGB
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	58	Robert Finch
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	57	MitchAlsup1
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	56	BGB
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	55	Robert Finch
13 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	3	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - ATOM	2	Robert Finch
13 Oct 24	Re: Tonights Tradeoff - ATOM	1	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	1	BGB
31 Oct 24	Page fetching cache controller	50	Robert Finch
31 Oct 24	Re: Page fetching cache controller	1	MitchAlsup1
6 Nov 24	Re: Q+ Fibonacci	48	Robert Finch
17 Apr 25	Re: register sets	47	Robert Finch
17 Apr 25	Re: register sets	46	Stephen Fuld
17 Apr 25	Re: register sets	1	Robert Finch
17 Apr 25	Re: register sets	44	MitchAlsup1
18 Apr 25	Re: register sets	43	Robert Finch
18 Apr 25	Re: register sets	42	MitchAlsup1
20 Apr 25	Re: register sets	41	Robert Finch
21 Apr 25	Re: auto predicating branches	40	Robert Finch
21 Apr 25	Re: auto predicating branches	39	Anton Ertl
21 Apr 25	Is an instruction on the critical path? (was: auto predicating branches)	1	Anton Ertl
21 Apr 25	Re: auto predicating branches	37	MitchAlsup1
22 Apr 25	Re: auto predicating branches	36	Anton Ertl
22 Apr 25	Re: auto predicating branches	1	MitchAlsup1
22 Apr 25	Re: auto predicating branches	34	Anton Ertl
22 Apr 25	Re: auto predicating branches	33	MitchAlsup1
23 Apr 25	Re: auto predicating branches	3	Stefan Monnier
23 Apr 25	Re: auto predicating branches	2	Anton Ertl
25 Apr 25	Re: auto predicating branches	1	MitchAlsup1
23 Apr 25	Re: auto predicating branches	29	Anton Ertl
23 Apr 25	Re: auto predicating branches	28	MitchAlsup1
24 Apr 25	Re: asynch register rename	27	Robert Finch
27 Apr 25	Re: fractional PCs	26	Robert Finch
27 Apr 25	Re: fractional PCs	25	MitchAlsup1
28 Apr 25	Re: fractional PCs	24	Robert Finch
28 Apr 25	Re: fractional PCs	13	MitchAlsup1
29 Apr 25	Re: fractional PCs	12	Robert Finch
5 May 25	Re: control co-processor	11	Robert Finch
5 May 25	Re: control co-processor	10	Al Kossow
5 May 25	Re: control co-processor	9	Stefan Monnier
6 May 25	Re: control co-processor	2	MitchAlsup1
7 May 25	Re: control co-processor	1	MitchAlsup1
7 May 25	Scan chains (was: control co-processor)	6	Stefan Monnier
7 May 25	Re: Scan chains (was: control co-processor)	2	Al Kossow
7 May 25	Re: Scan chains	1	Stefan Monnier
7 May 25	Re: Scan chains	3	MitchAlsup1
7 May 25	Re: Scan chains	2	Stefan Monnier
8 May 25	Re: Scan chains	1	MitchAlsup1
29 Apr 25	Re: fractional PCs	10	Robert Finch
29 Apr 25	Re: fractional PCs	9	MitchAlsup1
30 Apr 25	Re: fractional PCs	8	Robert Finch
30 Apr 25	Re: fractional PCs	6	Thomas Koenig
1 May 25	Re: fractional PCs	1	Robert Finch
2 May 25	Re: fractional PCs	4	moi
2 May 25	Re: millicode, extracode, fractional PCs	2	John Levine
2 May 25	Re: millicode, extracode, fractional PCs	1	moi
2 May 25	Re: fractional PCs	1	moi
30 Apr 25	Re: fractional PCs	1	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	Anton Ertl
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	BGB
6 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	MitchAlsup1