Newsportal USENET - Re: Tonights Tradeoff

On 2024-09-10 5:07 p.m., BGB wrote:

On 9/10/2024 9:58 AM, Robert Finch wrote:
On 2024-09-10 3:00 a.m., BGB wrote:
>
I haven't really understood how it could be implemented.
But, granted, my pipeline design is relatively simplistic, and my priority had usually been trying to make a "fast but cheap and simple" pipeline, rather than a "clever" pipeline.
>
Still not as cheap or simple as I would want.
>
>
>
Qupls has RISC-V style vector / SIMD registers. For Q+ every instruction can be a vector instruction, as there are bits indicating which registers are vector registers in the instruction. All the scalar instructions become vector. This cuts down on some of the bloat in the ISA. There is only a handful of vector specific instructions (about eight I think). The drawback is that the ISA is 48-bits wide. However, the code bloat is less than 50% as some instructions have dual- operations. Branches can increment or decrement and loop. Bigfoot uses a postfix word to indicate to use the vector form of the instruction. Bigfoot’s code density is a lot better being variable length, but I suspect it will not run as fast. Bigfoot and Q+ share a lot of the same code. Trying to make the guts of the cores generic.
>
>
In my case, the core ended up generic enough that it can support both BJX2 and RISC-V. Could almost make sense to lean more heavily into this (trying to consolidate more things and better optimize costs).
>
Did also recently get around to more-or-less implementing support for the 'C' extension, even as much as it is kinda dog-chewed and does not efficiently utilize the encoding space.
>
>
It burns a lot of encoding space on 6 and 8 bit immediate fields (with 11 bit branch displacements), more 5-bit register fields than ideal, ... so, has relatively few unique instructions, but:
Many of the instructions it does have are left with 3 bit register fields;
Has way a bit too many immediate-field layouts as it just sort of shoe- horns immediate fields into whatever bits are left.
>
Though, turns out I could skip a few things due to them being N/E in RV64 (RV32, RV64, and RV128 get a slightly different selection of ops in the C extension).
>
Like, many things in RV land make "annoying and kinda poor" design choices.
>
Then again, if one assumes that the role of 'C' is mostly:
   Does SP-relative loads/stores and MOV-RR.
>
Well, it does do this at least...
>
Nevermind if you want to use any of the ALU ops (besides ADD), or non- stack-relative Load/Store, well then, enjoy the 3 bit register fields.
>
And, still way too many immediate-field encodings for what is effectively load/store and a few ALU ops.
>
>
>
>
I am not as much a fan of RISC-V's 'V' extension mostly in that it would require essentially doubling the size of the register file.
>
The register file in Q+ is huge. One of the drawbacks of supporting vectors. There were 1024 physical registers for support. Reduced it to 512 and that still may be too many. There was a 4kb wide mapping ram, resulting in a warning message. I may have to split up components into multiple copies to get the desired size to work.
>
I am dealing with 64 registers.
In RV Mode, it is split between the GPRs and FPRs, in BJX2 a unified GPR space;
V would mean either extending the register set to 128, or adding a separate 32*128 bit register file, which is AFAICT the effective minimum.
Neither option would be good for resource cost.
   Cheaper would have been a SIMD system based on paired FPRs or similar.

>
And, if I were to do something like 'V' I would likely do some things differently:
Rather than having an instruction to load vector control state into CSR's, it would make more sense IMO to use bigger 64-bit instructions and encode the vector state directly into these instructions.
>
While this would be worse for code density, it would avoid needing to burn instructions setting up vector state, and would have less penalty (in terms of clock-cycles) if working with heterogeneous vectors.
>
>
Say, one possibility could be a combo-SIMD op with a control field:
   2b vector size
     64 / 128 / resv / resv
   2b element size
     8 / 16/ 32/ 64
   2b category
     wrap / modulo
     float
     signed saturate
     unsigned saturate
   6b operator
     add, sub, mul, mac, mulhi, ...
>
>
Q+ is setup almost that way. It uses 48b instructions. There is a 2b precision field in instructions that determines the lane/sub element size 8/16/32/64. The precision field also applies to scalar registers. The category is wrapped up in the opcode which is seven bits. One can do a float add on a vector register, then a bitwise operation on the same register. The vector registers work the same way as the scalar ones. There is no type state associated with them, unlike RISCV. To control the length (which lanes are active) there is a global mask register instead of a vector length register.
>
Sign control plus a vector indicator for each register spec results in a seven-bit spec, and there are four registers encoded in an instruction, which uses 28-bit, combined with a seven-bit opcode is 35 bits. There was just no way the instruction set was fitting in 32b. For a while the ISA was 40-bit, but I figured it was better to go 48- bit then add some additional functionality to make up for the wider ISA.
>
My leaning for 64 bit was mostly so that it does not break 32-bit alignment for the 32-bit instructions.
In this case:
WEX currently needs 32-bit alignment;
If I added superscalar to BJX2, it would likely also require 32-bit alignment.

>
>
Though, with not every combination necessarily being allowed.
Say, for example, if the implementation limits FP-SIMD to 4 or 8 vector elements.
>
Though, it may make sense to be asymmetric as well:
   2-vide vectors can support Binary64
   4-wide can support Binary32
   8-wide can support Binary16 ( + 4x FP16 units)
   16 can support FP8 ( + 8x FP8 units)
>
Whereas, say, 16x Binary32 capable units would be infeasible.
>
Well, as opposed to defining encodings one-at-a-time in the 32-bit encoding space.
>
>
It could be tempting to possibly consider using pipelining and multi- stage decoding to allow some ops as well. Say, possibly handling 8- wide vectors internally as 2x 4-wide operations, or maybe allowing 256-bit vector ops in the absence of 256-bit vectors in hardware.
>
...
>
Q+ has two ALU’s, which may, at some point, be expanded by two more ALUs with reduced functionality.
>
I have 3x 64-bit ALUs.
The first 2 may combine for 128-bit operations.

It sounds great, but I cannot seem to get Q+ to synthesize correctly. It reports the size as 45kLUTs, but I know the size is about double that, based on previous synthesis. A bunch of the components are showing up as zero sized in the synth report. Figuring out why stuff is being stripped out is a challenge. It runs in simulation at least for a few instructions. If components are being stripped out, why does it work in SIM? Scratches head. It does break the magical IPC of 1.0.
>
That lower estimate is still bigger than the BJX2 Core ATM...
   I was left to ponder ideas for a possible newer ISA design, intended mostly for "simplification and clean up".
Possible high-level summary:
   32/64/96 bit instructions (like XG2);
   64 GPRs (same);
   May design CPU to more natively co-execute RISC-V;
     Would be tempting to allow RISC-V as the administrative ISA.
   Would likely go over primarily to superscalar.
Encoding:
   Try to make the encodings cleaner and more consistent.
Would likely change the register space slightly, say:
   R0=ZZR / PC
   R1=LR
   R2/R3: GPRs
   R4..R13: GPRs
   R14=GBR (GP)
   R15=SP
   R16..R63: GPRs
In RV mode:
   R2 and R3 move to X14 and X15;
   SP moves to X2, GBR moves to X3 (GP);
   Other registers are identity mapped.
The elimination of a few registers would allow task switch code from RV Mode to work with the hypothetical new ISA. Limiting the scope of changes would also limit the amount of work needed in BGBCC (existing ABI would carry over with minimal changes, apart from the loss of R14 as a usable GPR).
It is also possible that such a CPU could be made to boot natively in RISC-V Mode.
Why then bother with a custom ISA? ...
   Mostly because RISC-V is still slower than ideal.
Though, a possibility could be to do a similar core just using a tweaked version of my existing XG2 encoding. This would break some of my existing ASM code (though could be lessened by quietly remapping R0/R1/ R14, to R48/R49/R62 (F16/F17/F30), at which point the registers would be essentially aliased).
Can note: Annoyingly, despite claims of the RV 'C' extensions' cheapness to decode, it still has an annoyingly high LUT impact. It was worse, but was able to slightly reduce it in the immediate decoding (as-is, it initially decodes immediate values into 12 bit forms, and then does a final extension to 33 bits, with a special case for the LUI immediate).
Despite the encoding's seeming attempt to limit how much the bits move around, having the bits move around significantly still results in burning a lot of LUTs on MUX'ing.

I have found that there can be a lot of registers available if they are implemented in BRAMs. BRAMs have lots of depth compared to LUT RAMs. BRAMs have a one cycle latency but that is just part of the pipeline. In Q+ about 40k LUTs are being used just to keep track of registers. (rename mappings and checkpoints).
Given a lot of available registers I keep considering trying a VLIW design similar to the Itanium, rotating register and all. But I have a lot invested in OoO.
Q+ has seven in-order pipeline stages before things get to the re-order buffer. Fetch (get instruction from I$), mux (for inserting interrupts and micro-code), vec (vector instruction expand), pac (pack nops resulting from expand), dec (decode), ren (rename registers), then que (queue to reorder buffer). After in-order stages are issue, execute, and commit.
I wanted to go for an out-of-order design to hide memory latency, even if the more complex design ran at a lower clock rate. I have seen the CPU execute/complete up to about eight instructions in SIM after a store begins for instance. Stores taking about four clock cycles. Loads take longer. The CPU really eats up the stores in function prolog code. Starts multiple store operations before any complete. With a large enough store queue, it can begin executing the instructions after the prolog code.

Date	Sujet	#	Auteur
7 Sep 24	Tonights Tradeoff	99	Robert Finch
7 Sep 24	Re: Tonights Tradeoff	98	MitchAlsup1
8 Sep 24	Re: Tonights Tradeoff	97	Robert Finch
8 Sep 24	Re: Tonights Tradeoff	96	MitchAlsup1
10 Sep 24	Re: Tonights Tradeoff	95	Robert Finch
10 Sep 24	Re: Tonights Tradeoff	17	BGB
10 Sep 24	Re: Tonights Tradeoff	12	Robert Finch
10 Sep 24	Re: Tonights Tradeoff	10	BGB
11 Sep 24	Re: Tonights Tradeoff	9	Robert Finch
11 Sep 24	Re: Tonights Tradeoff	7	Stephen Fuld
11 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	5	Robert Finch
12 Sep 24	Re: Tonights Tradeoff	4	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	3	Robert Finch
12 Sep 24	Re: Tonights Tradeoff	2	MitchAlsup1
13 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	1	BGB
11 Sep 24	Re: Tonights Tradeoff	1	MitchAlsup1
11 Sep 24	Re: Tonights Tradeoff	4	MitchAlsup1
12 Sep 24	Re: Tonights Tradeoff	3	Thomas Koenig
12 Sep 24	Re: Tonights Tradeoff	2	BGB
12 Sep 24	Re: Tonights Tradeoff	1	Robert Finch
11 Sep 24	Re: Tonights Tradeoff	77	MitchAlsup1
15 Sep 24	Re: Tonights Tradeoff	76	Robert Finch
16 Sep 24	Re: Tonights Tradeoff	75	Robert Finch
24 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	74	Robert Finch
24 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	73	MitchAlsup1
26 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	72	Robert Finch
26 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	71	MitchAlsup1
27 Sep 24	Re: Tonights Tradeoff - Background Execution Buffers	70	Robert Finch
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	69	Robert Finch
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	66	Anton Ertl
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	65	Robert Finch
5 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	64	Anton Ertl
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	63	Robert Finch
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	3	MitchAlsup1
9 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	Robert Finch
12 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	BGB
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	58	Robert Finch
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	57	MitchAlsup1
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	56	BGB
12 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	55	Robert Finch
13 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	3	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - ATOM	2	Robert Finch
13 Oct 24	Re: Tonights Tradeoff - ATOM	1	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - Carry and Overflow	1	BGB
31 Oct 24	Page fetching cache controller	50	Robert Finch
31 Oct 24	Re: Page fetching cache controller	1	MitchAlsup1
6 Nov 24	Re: Q+ Fibonacci	48	Robert Finch
17 Apr 25	Re: register sets	47	Robert Finch
17 Apr 25	Re: register sets	46	Stephen Fuld
17 Apr 25	Re: register sets	1	Robert Finch
17 Apr 25	Re: register sets	44	MitchAlsup1
18 Apr 25	Re: register sets	43	Robert Finch
18 Apr 25	Re: register sets	42	MitchAlsup1
20 Apr 25	Re: register sets	41	Robert Finch
21 Apr 25	Re: auto predicating branches	40	Robert Finch
21 Apr 25	Re: auto predicating branches	39	Anton Ertl
21 Apr 25	Is an instruction on the critical path? (was: auto predicating branches)	1	Anton Ertl
21 Apr 25	Re: auto predicating branches	37	MitchAlsup1
22 Apr 25	Re: auto predicating branches	36	Anton Ertl
22 Apr 25	Re: auto predicating branches	1	MitchAlsup1
22 Apr 25	Re: auto predicating branches	34	Anton Ertl
22 Apr 25	Re: auto predicating branches	33	MitchAlsup1
23 Apr 25	Re: auto predicating branches	3	Stefan Monnier
23 Apr 25	Re: auto predicating branches	2	Anton Ertl
25 Apr 25	Re: auto predicating branches	1	MitchAlsup1
23 Apr 25	Re: auto predicating branches	29	Anton Ertl
23 Apr 25	Re: auto predicating branches	28	MitchAlsup1
24 Apr 25	Re: asynch register rename	27	Robert Finch
27 Apr 25	Re: fractional PCs	26	Robert Finch
27 Apr 25	Re: fractional PCs	25	MitchAlsup1
28 Apr 25	Re: fractional PCs	24	Robert Finch
28 Apr 25	Re: fractional PCs	13	MitchAlsup1
29 Apr 25	Re: fractional PCs	12	Robert Finch
5 May 25	Re: control co-processor	11	Robert Finch
5 May 25	Re: control co-processor	10	Al Kossow
5 May 25	Re: control co-processor	9	Stefan Monnier
6 May 25	Re: control co-processor	2	MitchAlsup1
7 May 25	Re: control co-processor	1	MitchAlsup1
7 May 25	Scan chains (was: control co-processor)	6	Stefan Monnier
7 May 25	Re: Scan chains (was: control co-processor)	2	Al Kossow
7 May 25	Re: Scan chains	1	Stefan Monnier
7 May 25	Re: Scan chains	3	MitchAlsup1
7 May 25	Re: Scan chains	2	Stefan Monnier
8 May 25	Re: Scan chains	1	MitchAlsup1
29 Apr 25	Re: fractional PCs	10	Robert Finch
29 Apr 25	Re: fractional PCs	9	MitchAlsup1
30 Apr 25	Re: fractional PCs	8	Robert Finch
30 Apr 25	Re: fractional PCs	6	Thomas Koenig
1 May 25	Re: fractional PCs	1	Robert Finch
2 May 25	Re: fractional PCs	4	moi
2 May 25	Re: millicode, extracode, fractional PCs	2	John Levine
2 May 25	Re: millicode, extracode, fractional PCs	1	moi
2 May 25	Re: fractional PCs	1	moi
30 Apr 25	Re: fractional PCs	1	MitchAlsup1
13 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	Anton Ertl
4 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	BGB
6 Oct 24	Re: Tonights Tradeoff - Background Execution Buffers	1	MitchAlsup1