On 4/20/2024 12:07 PM, MitchAlsup1 wrote:
John Savard wrote:
On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
<quadibloc@servername.invalid> wrote:
And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.
That also happened to the captain of the _Titanic_.
Concer-tina-tanic !?!
Seems about right.
Seems like a whole lot of flailing with designs that seem needlessly complicated...
Meanwhile, has looked around and noted:
In some ways, RISC-V is sort of like MIPS with the field order reversed, and (ironically) actually smaller immediate fields (MIPS was using a lot of Imm16 fields. whereas RISC-V mostly used Imm12).
But, seemed to have more wonk:
A mode with 32x 32-bit GPRs;
A mode with 32x 64-bit GPRs;
Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits as needed for 64-bit operations?...
Integer operations (on 64-bit registers) that give UB or trap if values are outside of signed Int32 range;
Other operations that sign-extend the values but are ironically called "unsigned" (apparently, similar wonk to RISC-V by having signed-extended Unsigned Int);
Branch operations are bit-sliced;
...
I had preferred a different strategy in some areas:
Assume non-trapping operations by default;
Sign-extend signed values, zero-extend unsigned values.
Though, this is partly the source of some operations in my case assuming 33 bit sign-extended: This can represent both the signed and unsigned 32-bit ranges.
One could argue that sign-extending both could save 1 bit in some cases. But, this creates wonk in other cases, such as requiring an explicit zero extension for "unsigned int" to "long long" casts; and more cases where separate instructions are needed for Int32 and Int64 cases (say, for example, RISC-V needed around 4x as many Int<->Float conversion operators due to its design choices in this area).
Say:
RV64:
Int32<->Binary32, UInt32<->Binary32
Int64<->Binary32, UInt64<->Binary32
Int32<->Binary64, UInt32<->Binary64
Int64<->Binary64, UInt64<->Binary64
BJX2:
Int64<->Binary64, UInt64<->Binary64
With the Uint64 case mostly added because otherwise one needs a wonky edge case to deal with this (but is rare in practice).
The separate 32-bit cases were avoided by tending to normalize everything to Binary64 in registers (with Binary32 only existing in SIMD form or in memory).
Annoyingly, I did end up needing to add logic for all of these cases to deal with RV64G.
Currently no plans to implement RISC-V's Privileged ISA stuff, mostly because it would likely be unreasonably expensive. It is in theory possible to write an OS to run in RISC-V mode, but it would need to deal with the different OS level and hardware-level interfaces (in much the same way, as I needed to use a custom linker script for GCC, as my stuff uses a different memory map from the one GCC had assumed; namely that of RAM starting at the 64K mark, rather than at the 16MB mark).
In some cases in my case, there are distinctions between 32-bit and 64-bit compare-and-branch ops. I am left thinking this distinction may be unnecessary, and one may only need 64 bit compare and branch.
In the emulator, the current difference ended up mostly that the 32-bit version sees if the 32-bit and 64-bit version would give a different result and faulting if so, since this generally means that there is a bug elsewhere (such as other code is producing out-of-range values).
For a few newer cases (such as the 3R compare ops, which produce a 1-bit output in a register), had only defined 64-bit versions.
One could just ignore the distinction between 32 and 64 bit compare in hardware, but had still burnt the encoding space on this. In a new ISA design, I would likely drop the existence of 32-bit compare and use exclusively 64-bit compare.
In many cases, the distinction between 32-bit and 64-bit operations, or between 2R and 3R cases, had ended up less significant than originally thought (and now have ended up gradually deprecating and disabling some of the 32-bit 2R encodings mostly due to "lack of relevance").
Though, admittedly, part of the reason for a lot of separate 2R cases existing was that I had initially had the impression that there may have been a performance cost difference between 2R and 3R instructions. This ended up not really the case, as the various units ended up typically using 3R internally anyways.
So, say, one needs an ALU with, say:
2 inputs, one output;
Ability to bit-invert the second input
along with inverting carry-in, ...
Ability to sign or zero extend the output.
So, say, operations:
ADD / SUB (Add, 64-bit)
ADDSL / SUBSL (Add, 32-bit, sign extent)
ADDUL / SUBUL (Add, 32-bit, zero extent)
AND
OR
XOR
CMPEQ
CMPNE
CMPGT (CMPLT implicit)
CMPGE (CMPLE implicit)
CMPHI (unsigned GT)
CMPHS (unsigned GE)
...
Where, internally compare works by performing a subtract and then producing a result based on some status bits (Z,C,S,O). As I see it, ideally these bits should not be exposed at the ISA level though (much pain and hair results from the existence of architecturally visible ALU status-flag bits).
Some other features could still be debated though, along with how much simplification could be possible.
If I did a new design, would probably still keep predication and jumbo prefixes.
Explicit bundling vs superscalar could be argued either way, as superscalar isn't as expensive as initially thought, but in a simpler form is comparably weak (the compiler has an advantage that it can invest more expensive analysis into this, reorder instructions, etc; but this only goes so far as the compiler understands the CPU's pipeline, ties the code to a specific pipeline structure, and becomes effectively moot with OoO CPU designs).
So, a case could be made that a "general use" ISA be designed without the use of explicit bundling. In my case, using the bundle flags also requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to fall back to scalar (or superscalar) execution if it does not match.
For the most part, thus far nearly everything has ended up as "Mode 2", namely:
3 lanes;
Lane 1 does everything;
Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
Lane 3 only does Basic ALU ops and a few CONV ops and similar.
Lane 3 originally also did Shift, dropped to reduce cost.
Mem ops may eat Lane 3, ...
Where, say:
Mode 0 (Default):
Only scalar code is allowed, CPU may use superscalar (if available).
Mode 1:
2 lanes:
Lane 1 does everything;
Lane 2 does ALU, Shift, and CONV.
Mem ops take up both lanes.
Effectively scalar for Load/Store.
Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).
Had ended up with the ambiguous "extension" to the Mode 2 rules of allowing an FPU instruction to be executed from Lane 2 if there was not an FPU instruction in Lane 1, or allowing co-issuing certain FPU instructions if they effectively combine into a corresponding SIMD op.
In my current configurations, there is only a single memory access port. A second memory access port would help with performance, but is comparably a rather expensive feature (and doesn't help enough to justify its fairly steep cost).
For lower-end cores, a case could be made for assuming a 1-wide CPU with a 2R1W register file, but designing the whole ISA around this limitation and not allowing for anything more is limiting (and mildly detrimental to performance). If we can assume cores with an FPU, we can probably also assume cores with more than two register read ports available.
...