Liste des Groupes | Revenir à c arch |
BGB wrote:Yeah.
On 4/20/2024 12:07 PM, MitchAlsup1 wrote:John Savard wrote:
>On Sat, 20 Apr 2024 01:09:53 -0600, John Savard>
<quadibloc@servername.invalid> wrote:
>And, hey, I'm not the first guy to get sunk because of forgetting what>
lies under the tip of the iceberg that's above the water.That also happened to the captain of the _Titanic_.>
Concer-tina-tanic !?!
>Seems about right.
Seems like a whole lot of flailing with designs that seem needlessly complicated...Meanwhile, has looked around and noted:They, in effect, Litle-Endian-ed the fields.
In some ways, RISC-V is sort of like MIPS with the field order reversed,
I had gone further and used mostly 9/10 bit fields (mostly expanded to 10/12 in XG2).and (ironically) actually smaller immediate fields (MIPS was using a lot of Imm16 fields. whereas RISC-V mostly used Imm12).Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
the other hand, only has 12-bit immediates for shift instructions--
allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
have 16-bit immediates (universally sign extended).
I had saw a video talking about the Nintendo 64, and it was saying that the 2x paired 32-bit register mode was used more often than the native 64-bit mode, as the native 64-bit mode was slower as apparently it couldn't fully pipeline the 64-bit ops, so using it in this mode came at a performance hit (vs using it to run glorified 32-bit code).But, seemed to have more wonk:Repeating the mistake I made on Mc 88100....
A mode with 32x 32-bit GPRs; // unnecessary
A mode with 32x 64-bit GPRs;
Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits as needed for 64-bit operations?...
No direct equivalent in my case, nor any desire to add these.Integer operations (on 64-bit registers) that give UB or trap if values are outside of signed Int32 range;Isn't it just wonderful ??
Most are defined in ways that I feel are sensible.Other operations that sign-extend the values but are ironically called "unsigned" (apparently, similar wonk to RISC-V by having signed-extended Unsigned Int);
Branch operations are bit-sliced;
....I had preferred a different strategy in some areas:Assume trap/"do the expected thing" under a user accessible flag.
Assume non-trapping operations by default;
In my case; for the Baseline encoding, Ld/St displacements were unsigned only.Sign-extend signed values, zero-extend unsigned values.Another mistake I mad in Mc 88100.
Do you sign extend the 16-bit displacement on an unsigned LD ??
It is a tradeoff.Though, this is partly the source of some operations in my case assuming 33 bit sign-extended: This can represent both the signed and unsigned 32-bit ranges.These are some of the reasons My 66000 is 64-bit register/calculation only.
Dunno, neither RISC-V nor BJX2 has this...One could argue that sign-extending both could save 1 bit in some cases. But, this creates wonk in other cases, such as requiring an explicit zero extension for "unsigned int" to "long long" casts; and more cases where separate instructions are needed for Int32 and Int64 cases (say, for example, RISC-V needed around 4x as many Int<->Float conversion operators due to its design choices in this area).It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
is it a FP calculation or an integer calculation ?? If Rdouble is a
constant is the constant FP or int, if Rexponent is a constant is it
double or int,..... Does it raise FP overflow or integer overflow ??
I originally just had two instructions (FLDCI and FSTCI), but gave in an added more, because, say:Say:My 66000:
RV64:
Int32<->Binary32, UInt32<->Binary32
Int64<->Binary32, UInt64<->Binary32
Int32<->Binary64, UInt32<->Binary64
Int64<->Binary64, UInt64<->Binary64
BJX2:
Int64<->Binary64, UInt64<->Binary64
int64_t -> { uint64_t, float, double }
uint64_t -> { int64_t, float, double }
float -> { uint64_t, int64_t, double }
double -> { uint64_t, int64_t, float }
I had originally went with just using a 32-bit load/store, along with a (Binary32<->Binary64) conversion instruction.With the Uint64 case mostly added because otherwise one needs a wonky edge case to deal with this (but is rare in practice).The separate 32-bit cases were avoided by tending to normalize everything to Binary64 in registers (with Binary32 only existing in SIMD form or in memory).I saved LD and ST instructions by leaving float 32-bits in the registers.
It is a bit wonky, as I dealt with the scalar Binary32 ops for RV mostly by routing them through the logic for the SIMD ops. At least as far as most code should be concerned, it is basically the same (even if it does technically deviate from the RV64 spec, which defines the high bits of the register as encoding a NaN).Annoyingly, I did end up needing to add logic for all of these cases to deal with RV64G.No rest for the wicked.....
That, and the need for 3+ copies of the register file (for each operating mode), and the need for a hardware page-table walker, ...Currently no plans to implement RISC-V's Privileged ISA stuff, mostly because it would likely be unreasonably expensive.The sea of control registers or the sequencing model applied thereon ??
My 66000 allows access to all control registers via memory mapped I/O space.
For the Verilog version, option is more like:It is in theory possible to write an OS to run in RISC-V mode, but it would need to deal with the different OS level and hardware-level interfaces (in much the same way, as I needed to use a custom linker script for GCC, as my stuff uses a different memory map from the one GCC had assumed; namely that of RAM starting at the 64K mark, rather than at the 16MB mark).In some cases in my case, there are distinctions between 32-bit and 64-bit compare-and-branch ops. I am left thinking this distinction may be unnecessary, and one may only need 64 bit compare and branch.No 32-bit stuff, thereby no 32-bit distinctions needed.
In the emulator, the current difference ended up mostly that the 32-bit version sees if the 32-bit and 64-bit version would give a different result and faulting if so, since this generally means that there is a bug elsewhere (such as other code is producing out-of-range values).Saving vast amounts of power {{{not}}}
In "other ISA", these would be given different names:For a few newer cases (such as the 3R compare ops, which produce a 1-bit output in a register), had only defined 64-bit versions.Oh what a tangled web we.......
The vast majority of the 2R ops are things like "Convert A into B" or similar.One could just ignore the distinction between 32 and 64 bit compare in hardware, but had still burnt the encoding space on this. In a new ISA design, I would likely drop the existence of 32-bit compare and use exclusively 64-bit compare.In many cases, the distinction between 32-bit and 64-bit operations, or between 2R and 3R cases, had ended up less significant than originally thought (and now have ended up gradually deprecating and disabling some of the 32-bit 2R encodings mostly due to "lack of relevance").I deprecated all of them.
The ALU design in my case does not support inverting arbitrary inputs, only doing ADD/SUB, in various forms.Though, admittedly, part of the reason for a lot of separate 2R cases existing was that I had initially had the impression that there may have been a performance cost difference between 2R and 3R instructions. This ended up not really the case, as the various units ended up typically using 3R internally anyways.So, say, one needs an ALU with, say:you forgot carry, and inversion to perform subtraction.
2 inputs, one output;Ability to bit-invert the second inputSo, My 66000 integer adder has 3 carry inputs, and I discovered a way to
along with inverting carry-in, ...
Ability to sign or zero extend the output.
perform these that takes no more gates of delay than the typical 1-carry
in 64-bit integer adder. This gives me a = -b -c; for free.
Yeah...So, say, operations:
ADD / SUB (Add, 64-bit)
ADDSL / SUBSL (Add, 32-bit, sign extent) // nope
ADDUL / SUBUL (Add, 32-bit, zero extent) // nope
AND
OR
XOR
CMPEQ // 1 ICMP inst
CMPNE
CMPGT (CMPLT implicit)
CMPGE (CMPLE implicit)
CMPHI (unsigned GT)
CMPHS (unsigned GE)
....Where, internally compare works by performing a subtract and then producing a result based on some status bits (Z,C,S,O). As I see it, ideally these bits should not be exposed at the ISA level though (much pain and hair results from the existence of architecturally visible ALU status-flag bits).I agree that these flags should not be exposed through ISA; and I did not.
On the other hand multi-precision arithmetic demands at least carry {or
some other means which is even more powerful--such as CARRY.....}
Probably.Some other features could still be debated though, along with how much simplification could be possible.If I did a new design, would probably still keep predication and jumbo prefixes.I kept predication but not the way most predication works.
My work on Mc 88120 and K9 taught me the futility of things in the
instruction stream that provide artificial boundaries. I have a suspicion
that if you have the FPGA capable of allowing you to build a 8-wide machine, you would do the jumbo stuff differently, too.
Errm, assuming the compiler is capable of things like general-case inlining and loop-unrolling.Explicit bundling vs superscalar could be argued either way, as superscalar isn't as expensive as initially thought, but in a simpler form is comparably weak (the compiler has an advantage that it can invest more expensive analysis into this, reorder instructions, etc; but this only goes so far as the compiler understands the CPU's pipeline,Compilers are notoriously unable to outguess a good branch predictor.
Granted, but OoO isn't cheap.ties the code to a specific pipeline structure, and becomes effectively moot with OoO CPU designs).OoO exists, in a practical sense, to abstract the pipeline out of the compiler; or conversely, to allow multiple implementations to run the
same compiled code optimally on each implementation.
For sake of possible fancier OoO stuff, I upheld a basic requirement for the instruction stream:So, a case could be made that a "general use" ISA be designed without the use of explicit bundling. In my case, using the bundle flags also requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to fall back to scalar (or superscalar) execution if it does not match.Sounds like a bridge too far for your 8-wide GBOoO machine.
As can be noted, my thing is more a "LIW" rather than a "true VLIW".For the most part, thus far nearly everything has ended up as "Mode 2", namely:Try 6-lanes:
3 lanes;
Lane 1 does everything;
Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
Lane 3 only does Basic ALU ops and a few CONV ops and similar.
Lane 3 originally also did Shift, dropped to reduce cost.
Mem ops may eat Lane 3, ...
1,2,3 Memory ops + integer ADD and Shifts
4 FADD ops + integer ADD and FMisc
5 FMAC ops + integer ADD
6 CMP-BR ops + integer ADD
Where, say:Modeless.
Mode 0 (Default):
Only scalar code is allowed, CPU may use superscalar (if available).
Mode 1:
2 lanes:
Lane 1 does everything;
Lane 2 does ALU, Shift, and CONV.
Mem ops take up both lanes.
Effectively scalar for Load/Store.
Later defined that 128-bit MOV.X is allowed in a Mode 1 core.
If you mean that execution is mostly running end-to-end memory operations, yeah, this is basically true.Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).Had ended up with the ambiguous "extension" to the Mode 2 rules of allowing an FPU instruction to be executed from Lane 2 if there was not an FPU instruction in Lane 1, or allowing co-issuing certain FPU instructions if they effectively combine into a corresponding SIMD op.In my current configurations, there is only a single memory access port.This should imply that your 3-wide pipeline is running at 90%-95% memory/cache saturation.
Possibly.A second memory access port would help with performance, but is comparably a rather expensive feature (and doesn't help enough to justify its fairly steep cost).For lower-end cores, a case could be made for assuming a 1-wide CPU with a 2R1W register file, but designing the whole ISA around this limitation and not allowing for anything more is limiting (and mildly detrimental to performance). If we can assume cores with an FPU, we can probably also assume cores with more than two register read ports available.If you design around the notion of a 3R1W register file, FMAC and INSERT
fall out of the encoding easily. Done right, one can switch it into a 4R
or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.
....
Les messages affichés proviennent d'usenet.