On 10/22/2024 4:13 PM, MitchAlsup1 wrote:
On Tue, 22 Oct 2024 18:43:40 +0000, BGB wrote:
On 10/22/2024 10:26 AM, Anton Ertl wrote:
>
Several things in this paragraph makes no sense.
>
In particular, x86S is a proposal for a reduced version of the stuff
that current Intel and AMD CPUs support: There is full 64-bit support,
and 32-bit user-level support. x86S eliminates a part of the
compatibility path from systems of yesteryear, but not that many
people use these parts nowadays anyway. It's unclear to me what
benefits these changes are supposed to buy (unlike the elimination of
A32/T32 from some ARM chips, which obviously eliminates the whole
A32/T32 decoding path). It seems to me that most of the complexity of
current CPUs would still be there.
>
And I certainly prefer a CPU that has more capabilities to one that
has less capabilities. Sometimes I want to run old binaries.
>
So what would be my incentive as a user to buy an x86S CPU? Will they
sell them for less? I doubt it.
>
>
Yeah, basically my thoughts as well.
Business as usual...
>
Main effect it achieves is breaking legacy boot, doesn't seem like it
would either save all that much nor "solve" x86's longstanding issues.
Intel needs a better way to exit reset--and that means the MMU/TLBs
are already up and working at the time reset is exited. This cannot
be made backwards compatible.
-------------------------------
I am not sure how this would have much effect on cost either way.
A physical address mode could just be some edge case logic in the MMU (say, whenever there is a TLB miss with MMU disabled, it merely loads an identity mapped address into the TLB).
>
*1: Probably, say (if I were designing the encoding):
{Rb+Disp10s] //32-bit encoding
{Rb+Ri*FixSc] //32-bit encoding
{Rb+Ri*Sc] //64-bit encoding
[Rb+Disp33s] //64-bit encoding
[Rb+Ri*Sc+Disp11s] //64-bit encoding
[Rb+Ri*Sc+Disp33s] //96-bit encoding
[Rb+DISP16] // 32-bit 16 > 10
[Rb+Ri<<sc] // 32-bit
[Rb+Ri<<sc+DISP32] // 64-bit 32 > 11
[Rb+Ri<<sc+DISP64] // 96-bit 64 > 33
One doesn't want to burn too much encoding space...
If the goal is to redesign x86 as a RISC-like ISA, one is likely going to need a lot of space for opcode bits.
This is partly why I was thinking 32 registers rather than 64, along with the smaller immediate fields.
Say, one possible encoding scheme would be to use a similar base format to RISC-V:
ZZZZZZZ-ttttt-mmmmm-ZZZ-nnnnn-YY-YYYY1 //32-bit op
ZZZZZZZ-ttttt-mmmmm-ZZZ-nnnnn-YY-YYYY0 //64/96-bit op
Then, say:
1/2 the 32-bit encoding space is 3R ops:
1/4 the 32-bit encoding space is 3RI ops:
Remaining 1/4 for Imm16 and JMP/JCC and similar.
Say, could burn a 24/25-bit chunk of encoding space on JMP/CALL/JCC
iiiiiii-iiiii-iiiii-iii-Zcccc-YY-YYYY1
Where:
cccc is like x86 Jcc condition code,
but maybe reuse P and NP for JMP and CALL.
Though, might make sense to do CALL/RET using a link-register rather than the stack, even if x86 traditionally used the stack.
For 64-bit:
LD/ST/OPLD/OPST: [Rb+Disp10] expands to [Rb+Disp33s]
LD/ST/OPLD/OPST: [Rb+Ri*Sc] expands to [Rb+Ri*Sc+Disp11s] or Disp17s.
Remaining bits go to opcode.
Say:
ZZZZZZZ-ttttt-mmmmm-dss-nnnnn-YY-YYYY1 //MEM [Rm+Rt*Sc]
And:
iiiiiii-iiiii-iiiii-xxx-xxxxx-xx-xxxx0 -
ZZZZZZZ-ttttt-mmmmm-dss-nnnnn-YY-YYYY1 //MEM [Rm+Rt*Sc+Disp17s]
And:
iiiiiii-iiiii-iiiii-iii-iiiii-ii-iiii0 -
kkkkkkk-kkkkk-kkkkk-xxx-xxxxx-ii-xxxx0 -
ZZZZZZZ-ttttt-mmmmm-dss-nnnnn-YY-YYYY1 //MEM [Rm+Rt*Sc+Disp33s]
Could maybe use some of the extra bits encoding things like:
ADD.Q [Rb+Ri*Sc+Disp33s], Imm17s.
Or:
ADD.Q [Rb+Ri*Sc+Disp17s], Imm33s.
Say, by having a Rn/Imm bit, and a bit to specify which immediate is used as the constant and the other as the displacement.
But, with Disp10 base-forms, might expand to Disp33:
iiiiiii-iiiii-iiiii-xxx-iiiii-xx-xxxx0 -
iiiiiZZ-iiiii-mmmmm-dZi-nnnnn-YY-YYYY1 //MEM [Rm+Rt*Sc+Disp17s]
Where the 'd' flag could select between, say:
"ADD Rn, [Rm+Disp]" or "ADD [Rm+Disp], Rn"
32-bit encodings only allowing a register, whereas 64-bit encodings could allow an immediate.
But, not really sure...
In other news, went and wrote up a spec and threw together Verilog code for a reworked BSR4K/XG3 ISA design:
https://pastebin.com/yfrh50bkThere are still some holes (the spec is missing pretty much all the 2R ops for now), but alas. A few parts I have decided would not necessarily be carried over, as some newer instructions and the addition of a Zero Register made some amount of the former 2R and 2RI instructions no longer necessary (though, some could still be useful for efficiency; or have other useful roles like format conversion).
To make implementation cheaper/easier for me, it is essentially XG2RV with the bits shuffled around, a few inverted, and some special case changes (changes branch mechanics and some edge cases involving decoding immediate values).
Initially I tried putting the repacking logic at the front end of the ID stage, but (unsurprisingly), synthesis and timing wasn't too happy about this...
Ended up instead putting the repack logic at the end of the IF stage.
There was another possible idea that I could call BSR4J:
Would have done a simpler repacking scheme:
First 16 bits are repacked:
NMOP-YwYY-nnnn-mmmm => NMOY-mmmm-nnnn-YYPw
High 16 bits copied unmodified.
So, overall instruction format, seen as 32-bits, could have been:
ZZZZ-qnmo-oooo-XXXX-NMOY-mmmm-nnnn-YYPw
But, it was admittedly more tempting, if I am going to be repacking anyways, to make an attempt to "un-dog-chew" the instruction format (in an attempt to make it look nicer).
It is not fully settled yet, could jump over to the BSR4J strategy instead if the more aggressive repacking scheme is in-fact a bad idea. One arguable merit if does have is that all of the original 4-bit fields remain 4-bit aligned (and converting between XG2 and BSR4J would be significantly less bit-twiddling vs BSR4K; while still achieving the goal of being able to fit it into the same encoding space as RISC-V).
I have yet to decide on some specifics for the mapping of 2R instructions:
Simpler/cheaper: Use the same repacking as 3R ops for 2R ops;
Possible: Modify packing rules such that the 3rd part of the opcode field is also 4-bit aligned.
Say:
As-is : XXXX-kkWWWW-mmmmmm-ZZZZ-nnnnnn-QY-YYPw
Possible: XXXX-WWWWkk-mmmmmm-ZZZZ-nnnnnn-QY-YYPw
Would be slightly more logic complexity, but could make it easier to visually decode 2R instructions in a hexdump (but, likely not be worth the additional cost).
Expressing it as bits though makes it more obvious that I actually have less total encoding space than RISC-V, as the 6-bit register fields take their cut.
Say:
ZZZZZZZ-ttttt-mmmmm-ZZZ-nnnnn-YY-YYY11 (RV)
ZZZZ-oooooo-mmmmmm-ZZZZ-nnnnnn-QY-YYPw (XG3)
Each RV Y block has 10 bits of opcode.
Whereas, each XG3 Y block has 9 bits.
XG3 currently has 3 Y blocks reserved for 3R:
0/3/5 (~ 10.585 bits)
RV had 2 blocks for the core ISA (11 bits).
Though, the B extension squanders a few big chunks of it by defining some 4R instructions (such as Funnel Shift).
Contrast, BJX2 doesn't define any 32-bit 4R instructions.
Also, B extension further weakens the case for not having a register indexed addressing mode: Any core implementing the B extension's FSR instruction is going to need a 3R capable register port on the GPRs ...
Ironically, it seems that just the 'V' and 'P' extensions both end up eating more opcode space than the total 3R opcode space in BJX2...
XG3 effectively ends up spending 1/4 of the total encoding space on Jumbo prefixes.
Where
Baseline: FE/FF, 25 bits total
Imm64 only possible via an Imm16 base op (24+24+16=64).
XG2: xxx1-111x, 28 bits total
XG3: x1-1zyy, ~ 29.585
Though, in XG2 the 28-bit jumbo prefixes do allow 27+27+10=64, for 3RI Imm64 ops (with the remaining bit for immediate-extension vs more general instruction extension).
Reason it expands in XG3 is that the Jumbo prefixes effectively also eat the former PrWEX spaces (XG3 loses WEX and PrWEX, would need to use superscalar instead).
The would-be PrWEX spaces could maybe be used for something else, but unclear what at the moment (likewise the FA/FB blocks are unused and XG2 and effectively N/A in XG2RV; as the role they served in Baseline has effectively "fallen out of the scope of the ISA"; and being N/E in XG3).
I guess potentially a case could be made to potentially reclaim these blocks in XG2 as a range of "Non-predicated Scalar-Only" instructions.
Could almost relocate branches here (and work towards potentially reclaiming the space used by branches in the F0 block to eventually be eventually reassigned to 3R space, roughly worth 64 3R ops).
Say:
CCC1-101Z: FA/FB Imm25
ccc (inverse):
000: MOV Imm25, DLR //As-is, Original Role
001: BSR Disp25s
010: BT Disp25s
011: BF Disp25s
Doesn't solve the issue for Baseline, as these would be N/E in Baseline.
Having the encoding scheme fragmenting into a tree of encoding sub-variants is getting kind of annoying though, may eventually need to prune the tree.
...