On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:
On 9/27/2024 7:50 AM, Robert Finch wrote:
On 2024-09-27 5:46 a.m., BGB wrote:
---------
>
But, BJX2 does not spam the ADD instruction quite so hard, so is more
forgiving of latency. In this case, an optimization that reduces
common-case ADD to 1 cycle was being used (it only works though in the
CPU core if the operands are both in signed 32-bit range and no overflow
occurs; IIRC optionally using a sign-extended AGU output as a stopgap
ALU output before the output arrives from the main ALU the next cycle).
>
RISC-V group opinion is that "we have done nothing to damage pipeline
operating frequency". {{Except the moving of register specifier fields
between 32-bit and 16-bit instructions; except for: AGEN-RAM-CMP-ALIGN
in 2 cycles, and several others...}}
Would be better for timing/frequency if the ISA performed better with 2-cycle ADD and Shift latency, but this requires not spamming ADD and SHIFT at every opportunity...
But, unless one has 1-cycle ADD and SHIFT, RV takes a bigger hit...
Ironically, the relative performance hit from BJX2 code was lower with 2-cycle ADD and Shift (as was for 3 vs 2 cycle Load).
Better still if one can have 4-6 instructions between producing a value and using it as an input to another instructions (mostly not true of RV64 in these cases, as it seems to suffer more from RAW dependencies).
Though, this situation might improve if I get around to implementing instruction-shuffling for RV64. For BJX2 code, it tries to push dependent instructions further apart so that they can be bundled; but this would also work with how I had implemented superscalar, as well as generally reducing register RAW dependencies.
But, generally this does still impose limits:
Can't reorder instructions across a label;
Can't move instructions with an associated reloc;
Can't reorder memory instructions unless they can be proven to not alias (loads may be freely reordered, but the relative order of loads and stores may not unless provably non-aliasing);
...
The effectiveness of this does depend on how the C code is written though (works favorably with larger blocks of mostly-independent expressions).
Though, I suspect that the 2 vs 3 cycle load issue would apply more to GCC output than BGBCC output, since this was mostly effecting loading ALU constants from memory.
But, LUI+ADD isn't much better, but (in its favor) does not take a hit with a 3-cycle memory load.
>
>
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than
GCC does, with a fair chunk of the total instructions executed being
these two (more cycles are spent adding and shifting than doing memory
load or store...).
>
That seems to be a bit off. Mem ops are usually around 1/4 of
Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.
Cough, RV64...
Though, things may improve slightly if I add the 'Zba' instructions.
I had wanted to add these previously, but couldn't figure out a target string to configure GCC with Zba enabled.
It seems like the GCC build process only recognizes a finite list of known configurations, and "RV64GZba" did not seem to be in the list.
Though, "-march=..." seems to be more flexible (but, in past attempts, GCC doesn't seem to recognize the 'Zba' extension name).
Well, similar reasons to why I ended up with the 'F'/'D' GPU rather than Zfinx/Zdinx... Had initially wanted to use Zdinx, but not so much luck convincing GCC to use it.
Then again, RV64GC is kinda the default, and this seems to be what the RV64 Linux distros are going for.
There was whatever Qualcomm was doing, but I couldn't seem to find any documentation for it.
instructions. Spending more than 25% on adds and shifts seems like a
lot. Is it address calcs? Register loads of immediates?
>
>
It is both...
>
>
In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.
>
RV64G lacks register-indexed addressing, only having fixed displacement.
>
If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load
>
This case is bad...
Which makes that 16% (above) into 48% and renormalizing to::
~ 63% fixed-displacement;
~ 36% register-indexed and support instructions.
Yeah.
I think there are reasons here why I am generally getting lackluster performance out of RV64...
Whatever combination of factors have added up, means that my present attempt at BGBCC+RV64 is going at half the speed of XG2, and also around 30% slower than "GCC -O3"...
But, there is still a bit more work needed to try to make the code generation "not awful".
The more immediate priority is to try to get everything working "more or less correctly".
And, at the moment, Doom has some obvious issues:
Starts and goes immediately into first demo;
Normally, Doom waits at the title screen for around 5 seconds;
Just endlessly loops over the first demo;
Sound effects aren't working;
...
But, it took a while of debugging to get this far.
Once I got it working without crashing during start-up, and after fixing a bug that was breaking the ability to display anything, initially it was rendering without any walls (was originally drawing unbounded floors; similar to what one sees when they noclip outside the map).
Turns out this was a screw up in the register allocation. Where reworking the ABI registers is a bit of a bees nest here, and lots of code which assumed hard-coded registers and hard-coded register ranges.
>
>
Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0
>
Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range
Illustrating the falicy of 12-bits of displacement.
12-bits:
Can hit all of most typical stack frame and struct sizes: Yes;
Good as an end-all: No.
The fallback cases in RV64 hurt somewhat worse.
Like, even vs the original 9-bit displacements in BJX2:
MOV.L (R4, Disp9u), R5 //0..2K
And:
MOV Disp24u, R0
MOV.L (R4, R0), R5
So, in the original ISA design (before jumbo prefixes), a 2-op sequence could still give a 16MB range relative to a base register (and if used in this way, with an explicit R0, would mimic the SuperH behavior of using an unscaled displacement).
In this particular case:
MOV.L (R4, R0), R5
MOV.L (R0, R4), R5
Were treated as equivalent (same with the '@' character, but I mostly dropped the use of the '@' prefix character; and see little point in prefixing register names with '%' either; just makes things uglier and harder to type).
...
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
I get ±32K with 16-bit displacements
Baseline has special case 32-bit ops:
MOV.L (GBR, Disp10u), Rn //4K
MOV.Q (GBR, Disp10u), Rn //8K
But, in XG2, it gains 2 bits:
MOV.L (GBR, Disp12u), Rn //16K
MOV.Q (GBR, Disp12u), Rn //32K
Jumbo can encode +/- 4GB here (64-bit encoding).
MOV.L (GBR, Disp33s), Rn //+/- 4GB
MOV.Q (GBR, Disp33s), Rn //+/- 4GB
Mostly because GBR displacements are unscaled.
Plan for XG3 is that all Disp33s encodings would be unscaled.
>
Theoretically, could save 1 instruction here, but would need to add two
more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings;
and I still need the base form for other cases.
>
>
More compact way to load/store global variables would be to use absolute
32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32
MEM Rd,[IP,,DISP32/64] // IP-rel
-----
BJX2 can also do (PC, Disp33s) in a single logical instruction...
But, RISC-V can't...
>
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...
With pseudo-instructions:
SUB Rs, $(MIN), R10
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
This part is similar to how it worked before, but slightly longer:
SUB Rs, $(MIN), R4
CMPHI $(MAX-MIN), R4
BT Lbl_Dfl
ADD R4, R4
BRAF R4 //Branches to PC+(R4<<1)
BRA Lbl_Case0
BRA Lbl_Case1
...
Main difference being that FDPIC uses fat
function pointers and does the GP reload on the caller, vs PBO where I
use narrow function pointers and do the reload on the callee (with
load-time fixups for the PBO Offset).
>
>
The result of all this is a whole lot of
unnecessary
Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot of
shifts and adds.
>
BGBCC basically entirely dethrowns the Load and Store ops ...
>
>
Possibly more so than GCC, which tended to turn most constant loads into
memory loads. It would load a table of constants into a register and
then pull constants from the table, rather than compose them inline.
>
Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)
>
And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant
>
>
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
>
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.
Granted...
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Yech
But, 4 is still less than 6.
ARM64 had instead used the strategy of being able to load a 16-bit value into each position within a register.
Both strategies will still need 4 instructions to load a full 64 bits.
As noted, for BJX2 I had mostly gone over to a jumbo-prefixed encoding, which allows doing the whole thing in 3 instruction words and a single clock-cycle.
>
>
Don't you ever snip anything ??
Sometimes...