On 9/27/2024 7:50 AM, Robert Finch wrote:
On 2024-09-27 5:46 a.m., BGB wrote:
Had recently been working on getting BGBCC to target RV64G.
>
>
So, for Doom, ".text" sizes at the moment:
BGBCC+XG2 : 292K (seems to have shrank in all this)
BGBCC+RV64: 438K
GCC +RV64: 445K (PIE)
>
Doom Framerates:
BGBCC+XG2 : ~ 25-30
BGBCC+RV64: ~ 8-14
GCC +RV64: ~ 15-20
>
Start of E1M1 (framerate):
BGBCC+XG2 : ~ 25
BGBCC+RV64: ~ 12
GCC +RV64: ~ 16
>
>
How does RV64 compare to BGBCC+XG2? IS it trying to execute more than one op at a time? I assume XG2 is.
For XG2, it is potentially 3-wide (but, on-average bundle size is around 1.2 to 1.4).
For RV64, it is 2-wide in-order superscalar.
However, in the current BGBCC output, there is no instruction-shuffling, so the generated code has fairly low ILP (bundle ~ 1.05).
This is vs ~ 1.10 to 1.25 for GCC output.
Seemingly, RV64 seems to also result in a higher-amount of register RAW dependencies. It is comparably also more sensitive to ALU and memory-load latency (RV64 is effected much more than BJX2 by 1 vs 2 cycle latency on ADD instructions).
But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces common-case ADD to 1 cycle was being used (it only works though in the CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap ALU output before the output arrives from the main ALU the next cycle).
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than GCC does, with a fair chunk of the total instructions executed being these two (more cycles are spent adding and shifting than doing memory load or store...).
That seems to be a bit off. Mem ops are usually around 1/4 of instructions. Spending more than 25% on adds and shifts seems like a lot. Is it address calcs? Register loads of immediates?
It is both...
In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.
RV64G lacks register-indexed addressing, only having fixed displacement.
If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load
This case is bad...
Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0
Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range
Comparably, XG2 has a 16K or 32K reach here (depending on immediate size), which hits most of the global variables. The fallback Jumbo encoding hits the rest.
Theoretically, could save 1 instruction here, but would need to add two more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings; and I still need the base form for other cases.
More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32
BGBCC does not use these as they are incompatible with TestKern as it exists. Well, PC-Rel can be made to work, but requires loading a separate copy of each EXE or DLL for every process instance when using a single address space (which is inefficient).
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 (there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off from PBO (namely, using GP for a global section and then chaining the sections for each binary). Main difference being that FDPIC uses fat function pointers and does the GP reload on the caller, vs PBO where I use narrow function pointers and do the reload on the callee (with load-time fixups for the PBO Offset).
Similar for fixed displacement Load/Store greater than 2K (though, only ~ 0.05%):
LUI X5, DispHi
ADDI X5, Xm, DispLo
LW Xn, X5, 0
Constant loading:
32-bit (majority):
sssssxxx: ADDI Xn, X0, Imm
xxxxx000: LUI Xn, ImmHi
00000000..7FFFF7FF:
LUI Xn, ImmHi
ADDI Xn, Xn, ImmLo
7FFFF800..7FFFFFFF:
LUI Xn, ImmHi
XORI Xn, Xn, ImmLo
80000000..FFFFFFFF (sign extended):
LUI Xn, ImmHi
ADDI Xn, Xn, ImmLo
80000000..FFFFFFFF (zero extended):
LUI Xn, ImmHi
XORI Xn, Xn, ImmLo
But, only if ImmLo is 0x800..0xFFF ...
64-bit (not R5):
xxxxx000_00000000:
LUI Xn, ImmHi
SLLI Xn, Xn, 32
xxxxxyyy_00000000:
LUI Xn, ImmX
ADDI Xn, Xn, ImmY
SLLI Xn, Xn, 32
xxxxxyyy_zzzzz000:
LUI X5, ImmX
ADDI X5, X5, ImmY
SLLI X5, X5, 32
LUI Xn, ImmZ
ADD Xn, Xn, X5
xxxxxyyy_zzzzzwww:
LUI X5, ImmX
ADDI X5, X5, ImmY
SLLI X5, X5, 32
LUI Xn, ImmZ
ADDI Xn, Xn, ImmW
ADD Xn, Xn, X5
One also needs to use shifts to sign or zero-extend things.
EXTS.W maps to:
SLLI Xn, Xm, 48
SRAI Xn, Xn, 48
EXTU.W maps to:
SLLI Xn, Xm, 48
SRLI Xn, Xn, 48
EXTU.L maps to:
SLLI Xn, Xm, 32
SRLI Xn, Xn, 32
...
The result of all this is a whole lot of Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.
BGBCC basically entirely dethrowns the Load and Store ops ...
Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and then pull constants from the table, rather than compose them inline.
Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)
And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a SHORI instruction.
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
FWIW: BJX2 had used this strategy (prior to adding jumbo prefixes) but had called this instruction LDSH instead (but BGBCC also generally also accepts SHORI, and for the XG3 idea decided to go over to calling it SHORI as well).
Comparably, this needs less encoding space vs LUI, and is also more versatile.
But, XG3RV (AKA: CoEx mode) would effectively have access to both:
Could pull LUI from RV64 or SHORI from XG3.
With jumbo prefixes, SHORI isn't needed as often, but still sometimes useful (or, if for whatever reason one doesn't want to use jumbo prefixes in a given situation; or the code just so happened to express something like "i=(i<<16)|0x5555;" or similar).
Though, had also considered possibly sneaking SHORI into the RV64 encoding space as well by reusing the would-be (but unused) ORIW and XORIW encodings for SHORI+FLDCH and a 17-bit constant load (essentially gluing the Rs1 field onto the immediate to give a 17-bit sign-extended immediate).
But, for now, BGBCC is trying to generate "proper" RV64...
Even if the situation is "not very good"...
>
Array Load/Store:
XG2: 1 instruction
RV64: 3 instructions
>
Global Variable:
XG2: 1 instruction (if within 2K of GBR)
RV64: 1 or 4 instructions
>
Constant Load into register (not R5):
XG2: 1 instruction
RV64: ~ 1-6
>
Operator with 32-bit immediate:
BJX2: 1 instruction;
RV64: 3 instructions.
>
Operator with 64-bit immediate:
BJX2: 1 instruction;
RV64: 4-9 instructions.
>
>
Observations (RV64):
LUI+ADD can't actually represent all possible 32-bit constants.
Those near the signed-overflow point can't be expressed directly.
LUI+XOR can get a lot of these cases.
0x80000000ULL .. 0xFFFFFFFFULL can be partly covered by LUI+XOR.
>
For full 64-bit constants, generally need:
LUI+ADD+LUI+ADD+SLLI+ADD
And, two registers.
>
There is currently an ugly edge case where BGBCC has to fall back to:
LUI X5, ImmHi
ADDI X5, X5, ImmMi
( SLLI X5, X5, 12; ADD X5, X5, ImmFrag )+
>
Namely when needing to load a 64-bit constant and R5 is the only register.
>
So, if the compiler tries to emit, say:
AND R18, 0x7F7F7F7F7F7F7F7F, R10
One may end up with, say:
LUI X5, 0x7F7F
ADDI X5, X5, 0x7F8
SLLI X5, X5, 12
ADDI X5, X5, 0xF7F
SLLI X5, X5, 12
ADDI X5, X5, 0x7F8
SLLI X5, X5, 12
ADDI X5, X5, 0xF7F
AND X10, X18, X5
>
Which, granted, kinda sucks...
>
This is partly because BGBCC's code generation currently assumes it can just emit whatever here and the assembler will sort it out.
>
But, this case comes up rarely.
In BJX2, 33 bit cases would be handled by Jumbo prefixes, and generally 64-bit cases by loading the value into R0.
>
In RV64, this is needed for anything that doesn't fit in 12-bits; with X5 taking on the role for scratch constants and similar.
>
...
>
Floating point is still a bit of a hack, as it is currently implemented by shuffling values between GPRs and FPRs, but sorta works.
>
>
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just SLT and SLTU.
>
>
Doom isn't quite working correctly yet with BGBCC+RV64 (still has some significant bugs), but in general game logic and rendering now seems to be working.
>
>
But, yeah, generating code for RV is more of a pain as the compiler has to work harder to try to express what it wants to do in the instructions that are available.
>
>
But, yeah, it is what it is...
>
I sort of needed RV64 support for some possible later experiments (the idea for the hybid XG3-CoEx ISA idea would depend on having working RV64 support as a prerequisite).
>
...
>