On 9/27/2024 10:52 AM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 9:46:01 +0000, BGB wrote:
Had recently been working on getting BGBCC to target RV64G.
>
Array Load/Store:
M66: 1 instruction
XG2: 1 instruction
RV64: 3 instructions
>
Yeah.
It being not 1 instruction on RV64 is one of my major annoyances with RV64.
The mess of dealing with constants is another big annoyance.
Global Variable:
M66: 1 instruction (anywhere in 64-bit memory)
XG2: 1 instruction (if within 2K of GBR)
RV64: 1 or 4 instructions
>
Screwed this up slightly:
1-instruction, 2K of GP, is for RV64 (not XG2).
For BJX2:
Baseline is 4K or 8K (depending on operand size).
XG2 is 16K or 32K (depending on operand size).
For a simple 32-bit encoding.
With a jumbo prefix, it is still 1 instruction...
Technically, there is a ~ 2GB limit for the size of ".data"+".bss", but this is also a limit with PE/COFF; and isn't likely to be a big issue in practice.
Would need either to modify PE further or jump over to an ELF variant to support 64-bit RVAs.
But, if one wants to support large global uninitialized arrays (the main use case that is likely to exceed such a limit), could have the compiler silently turn them into statically-initialized "calloc()" calls.
Well, nevermind if BGBCC will currently break if the size of a section exceeds 8MB (due mostly to an issue for how it internally represents its base relocs). Fixing this is on an eventual TODO list.
Mostly it is due to say:
(31:28): Base Reloc Type
(27:23): Section Number
(22: 0): Section Offset
Which is then converted to the PE/COFF format:
Txxx:
T = Base Reloc Type
xxx = Offset within logical 4K page.
With an extension:
0000: NOP
0001..07FF: Advance current position by 1..2047 pages (8MB).
0800..0FFF: Reverse current position by -1..-2048 pages.
Though, the negative case isn't generally used, as the relocs are sorted by address. Doing it this way (vs individual sub-blocks for each page) can further compact the base reloc table.
Either way, already significantly more compact than ELF symbol and reloc tables.
Constant Load into register (not R5):
M66: 0 instructions
XG2: 1 instruction
RV64: ~ 1-6
>
Operator with 32-bit immediate:
M66: 1 instruction
BJX2: 1 instruction;
RV64: 3 instructions.
>
Operator with 64-bit immediate:
M66: 1 instruction
BJX2: 1 instruction;
RV64: 4-9 instructions.
>
>
>
Floating point is still a bit of a hack, as it is currently implemented
by shuffling values between GPRs and FPRs, but sorta works.
My 66000 has a common register file.
Same with BJX2 and XG2.
Not true with RISC-V though.
BGBCC currently assumes a common register file, and the original FPU code (from SH-4) has atrophied (and, more so, the RISC-V FPU is somewhat different from the SH-4 FPU; so not like stale SH-4 code would work effectively on RV64 anyways).
So, for now, BGBCC is assuming that the FPU works like the one in BJX2-Baseline (with 32 GPRs, and all of the FPU values in GPRs).
But, this is crappy on RV64:
FMV.D.X F0, Xs
FMV.D.X F1, Xt
FADD.D F3, F0, F1
FMV.X.D Xn, F3
Though, it works for the time-being, and is mostly N/A to Doom, which is nearly entirely integer code.
Current thinking is to possibly have it as a funky sub-more where logical R32..R63 is allowed but only if the value is a floating-point type.
For now, BGBCC also assumes that (like for BJX2) all scalar floating-point values are represented in registers in Binary64 form.
Meanwhile, the assembler (in RV64 mode) assumes that:
R0..R31 means X0..X31
R32..R63 means F0..F31
Or, basically the same idea as XG2RV Mode.
For now, this means loading a Binary32 from memory looks kinda like:
LW Xn, ...
// fake "FLDCF Rn, Rn"
FMV.D.X F0, Xn
FCVT.D.S F1, F0
FMV.X.D Xn, F1
But, could fake "FMOV.S" as:
FLW F0, ...
FCVT.D.S F1, F0
FMV.X.D Xn, F1
But, yeah...
Trying to target RV64 by having BGBCC pretend it is a crappier version of BJX2 probably isn't ideal, granted...
Did already run into a few cases where stuff was breaking because BGBCC dealt with BJX2 by pretending it was still BJX1 or SH-4; in ways that entirely broke for RISC-V (there being a few fundamental differences between them).
However, a few cases were more efficient with the "common case of BJX2 and RV" than the "stale code pretending it was still SH-4" cases...
Because, say, both SH-4 and RISC-V are bad in their own ways...
Most common issue here was that RV64G lacks any concept of a direct equivalent to the SuperH SR.T bit, and currently BGBCC makes no effort to fake it.
Idea was for the XG3 idea to demote it to optional, but in CoEx mode (likely the primary use-case to make it worth the hassle), there would be no SR.T bit, ...
Though, an alternative would be to either figure out a scheme to shove all the predicated instructions into half the encoding space, or to only allow '?T' predication.
Well, or try to partly rework the encoding scheme to free up 1 bit of entropy.
One possibility could be that CoEx would deal with predication like:
* ZZoo-oooo-ZZmm-mmmm-ZZnn-nnnn-WZZZ-YT10
Where, YT:
* 00: OP?T YY=00|10 depending on W bit, W forced to 0.
* 01: OP?F (Same as above)
* 10: OP?T YY=11
* 11: OP?F YY=11
This isn't really a great scheme though, as it basically cuts off half of both the 3R and 3RI spaces.
Otherwise, the only other option being "short forward branches".
>
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just
SLT and SLTU.
My 55000 can do:: 1 < i && i <= MAX in 1 instruction
BJX2:
CMPQGT R4, 1, R16
CMPQLT R4, (MAX+1), R17 //*1
AND R16, R17, R5
So, more than 1 instruction, but less than faking it with SLT / SLTI ...
But, one already needs the logic to support the full range of comparisons for other cases, so it doesn't cost much on the logic side to also have them in the ISA (in effect, one is taking the output from the existing comparison ops and routing it to a register rather than
updating the SR.T bit; and the various cases can be encoded by flipping the arguments or logically inverting the output bit).
The current idea for the XG3 idea was to encode this based on whether or not the output is the zero-register, which isn't too far from how it works already in the pipeline.
It is better for performance though to be able to flip the output bit in the pipeline than to need to use an XOR instruction or similar.
*1: 32-bit encoding only encodes up to 31, max up to 256M can be encoded with a jumbo prefix. There is no Imm33 encoding at present for this (but, in theory, could twiddle the Imm29s encoding to Imm33s in XG2; using a similar trick to that used to get Imm57s to a full Imm64).
Though, one thing I have noted is that a lot of "RISC-V people" respond rather negatively if one says anything other than praise for the design of RISC-V, even if it is "not great" if the priority is performance...
>
....