Liste des Groupes | Revenir à c arch |
Had recently been working on getting BGBCC to target RV64G.How does RV64 compare to BGBCC+XG2? IS it trying to execute more than one op at a time? I assume XG2 is.
So, for Doom, ".text" sizes at the moment:
BGBCC+XG2 : 292K (seems to have shrank in all this)
BGBCC+RV64: 438K
GCC +RV64: 445K (PIE)
Doom Framerates:
BGBCC+XG2 : ~ 25-30
BGBCC+RV64: ~ 8-14
GCC +RV64: ~ 15-20
Start of E1M1 (framerate):
BGBCC+XG2 : ~ 25
BGBCC+RV64: ~ 12
GCC +RV64: ~ 16
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than GCC does, with a fair chunk of the total instructions executed being these two (more cycles are spent adding and shifting than doing memory load or store...).That seems to be a bit off. Mem ops are usually around 1/4 of instructions. Spending more than 25% on adds and shifts seems like a lot. Is it address calcs? Register loads of immediates?
Array Load/Store:
XG2: 1 instruction
RV64: 3 instructions
Global Variable:
XG2: 1 instruction (if within 2K of GBR)
RV64: 1 or 4 instructions
Constant Load into register (not R5):
XG2: 1 instruction
RV64: ~ 1-6
Operator with 32-bit immediate:
BJX2: 1 instruction;
RV64: 3 instructions.
Operator with 64-bit immediate:
BJX2: 1 instruction;
RV64: 4-9 instructions.
Observations (RV64):
LUI+ADD can't actually represent all possible 32-bit constants.
Those near the signed-overflow point can't be expressed directly.
LUI+XOR can get a lot of these cases.
0x80000000ULL .. 0xFFFFFFFFULL can be partly covered by LUI+XOR.
For full 64-bit constants, generally need:
LUI+ADD+LUI+ADD+SLLI+ADD
And, two registers.
There is currently an ugly edge case where BGBCC has to fall back to:
LUI X5, ImmHi
ADDI X5, X5, ImmMi
( SLLI X5, X5, 12; ADD X5, X5, ImmFrag )+
Namely when needing to load a 64-bit constant and R5 is the only register.
So, if the compiler tries to emit, say:
AND R18, 0x7F7F7F7F7F7F7F7F, R10
One may end up with, say:
LUI X5, 0x7F7F
ADDI X5, X5, 0x7F8
SLLI X5, X5, 12
ADDI X5, X5, 0xF7F
SLLI X5, X5, 12
ADDI X5, X5, 0x7F8
SLLI X5, X5, 12
ADDI X5, X5, 0xF7F
AND X10, X18, X5
Which, granted, kinda sucks...
This is partly because BGBCC's code generation currently assumes it can just emit whatever here and the assembler will sort it out.
But, this case comes up rarely.
In BJX2, 33 bit cases would be handled by Jumbo prefixes, and generally 64-bit cases by loading the value into R0.
In RV64, this is needed for anything that doesn't fit in 12-bits; with X5 taking on the role for scratch constants and similar.
...
Floating point is still a bit of a hack, as it is currently implemented by shuffling values between GPRs and FPRs, but sorta works.
RV's selection of 3R compare ops is more limited:
RV: SLT, SLTU
BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just SLT and SLTU.
Doom isn't quite working correctly yet with BGBCC+RV64 (still has some significant bugs), but in general game logic and rendering now seems to be working.
But, yeah, generating code for RV is more of a pain as the compiler has to work harder to try to express what it wants to do in the instructions that are available.
But, yeah, it is what it is...
I sort of needed RV64 support for some possible later experiments (the idea for the hybid XG3-CoEx ISA idea would depend on having working RV64 support as a prerequisite).
...
Les messages affichés proviennent d'usenet.