Re: Misc: BGBCC targeting RV64G, initial results...

Liste des GroupesRevenir à c arch 
Sujet : Re: Misc: BGBCC targeting RV64G, initial results...
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.arch
Date : 27. Sep 2024, 19:26:28
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vd6tf8$r27h$1@dont-email.me>
References : 1 2
User-Agent : Mozilla Thunderbird
On 9/27/2024 7:50 AM, Robert Finch wrote:
On 2024-09-27 5:46 a.m., BGB wrote:
Had recently been working on getting BGBCC to target RV64G.
>
>
So, for Doom, ".text" sizes at the moment:
   BGBCC+XG2 : 292K (seems to have shrank in all this)
   BGBCC+RV64: 438K
   GCC  +RV64: 445K (PIE)
>
Doom Framerates:
   BGBCC+XG2 : ~ 25-30
   BGBCC+RV64: ~  8-14
   GCC  +RV64: ~ 15-20
>
Start of E1M1 (framerate):
   BGBCC+XG2 : ~ 25
   BGBCC+RV64: ~ 12
   GCC  +RV64: ~ 16
>
>
How does RV64 compare to BGBCC+XG2? IS it trying to execute more than one op at a time? I assume XG2 is.
 
For XG2, it is potentially 3-wide (but, on-average bundle size is around 1.2 to 1.4).
For RV64, it is 2-wide in-order superscalar.
However, in the current BGBCC output, there is no instruction-shuffling, so the generated code has fairly low ILP (bundle ~ 1.05).
This is vs ~ 1.10 to 1.25 for GCC output.
Seemingly, RV64 seems to also result in a higher-amount of register RAW dependencies. It is comparably also more sensitive to ALU and memory-load latency (RV64 is effected much more than BJX2 by 1 vs 2 cycle latency on ADD instructions).
But, BJX2 does not spam the ADD instruction quite so hard, so is more forgiving of latency. In this case, an optimization that reduces common-case ADD to 1 cycle was being used (it only works though in the CPU core if the operands are both in signed 32-bit range and no overflow occurs; IIRC optionally using a sign-extended AGU output as a stopgap ALU output before the output arrives from the main ALU the next cycle).

 
Comparably, it appears BGBCC leans more heavily into ADD and SLLI than GCC does, with a fair chunk of the total instructions executed being these two (more cycles are spent adding and shifting than doing memory load or store...).
 That seems to be a bit off. Mem ops are usually around 1/4 of instructions. Spending more than 25% on adds and shifts seems like a lot. Is it address calcs? Register loads of immediates?
 
It is both...
In BJX2, the dominant instruction tends to be memory Load.
   Typical output from BGBCC for Doom is (at runtime):
     ~ 70% fixed-displacement;
     ~ 30% register-indexed.
   Static output differs slightly:
     ~ 84% fixed-displacement;
     ~ 16% register-indexed.
RV64G lacks register-indexed addressing, only having fixed displacement.
If you need to do a register-indexed load in RV64:
   SLLI  X5, Xo, 2  //shift by size of index
   ADD X5, Xm, X5  //add base and index
   LW  Xn, X5, 0   //do the load
This case is bad...
Also global variables outside the 2kB window:
   LUI   X5, DispHi
   ADDI  X5, X5, DispLo
   ADD   X5, GP, X5
   LW    Xn, X5, 0
Where, sorting global variables by usage priority gives:
   ~ 35%: in range
   ~ 65%: not in range
Comparably, XG2 has a 16K or 32K reach here (depending on immediate size), which hits most of the global variables. The fallback Jumbo encoding hits the rest.
Theoretically, could save 1 instruction here, but would need to add two more reloc types to allow for:
   LUI, ADD, Lx
   LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings; and I still need the base form for other cases.
More compact way to load/store global variables would be to use absolute 32-bit or PC relative:
   LUI + Lx/Sx : Abs32
   AUIPC + Lx/Sx : PC-Rel32
BGBCC does not use these as they are incompatible with TestKern as it exists. Well, PC-Rel can be made to work, but requires loading a separate copy of each EXE or DLL for every process instance when using a single address space (which is inefficient).
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64 (there does seem to be some interest for ELF FDPIC but limited to 32-bit RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off from PBO (namely, using GP for a global section and then chaining the sections for each binary). Main difference being that FDPIC uses fat function pointers and does the GP reload on the caller, vs PBO where I use narrow function pointers and do the reload on the callee (with load-time fixups for the PBO Offset).
Similar for fixed displacement Load/Store greater than 2K (though, only ~ 0.05%):
   LUI  X5, DispHi
   ADDI X5, Xm, DispLo
   LW   Xn, X5, 0
Constant loading:
   32-bit (majority):
     sssssxxx: ADDI Xn, X0, Imm
     xxxxx000: LUI Xn, ImmHi
     00000000..7FFFF7FF:
       LUI Xn, ImmHi
       ADDI Xn, Xn, ImmLo
     7FFFF800..7FFFFFFF:
       LUI Xn, ImmHi
       XORI Xn, Xn, ImmLo
     80000000..FFFFFFFF (sign extended):
       LUI Xn, ImmHi
       ADDI Xn, Xn, ImmLo
     80000000..FFFFFFFF (zero extended):
       LUI Xn, ImmHi
       XORI Xn, Xn, ImmLo
       But, only if ImmLo is 0x800..0xFFF ...
   64-bit (not R5):
     xxxxx000_00000000:
       LUI   Xn, ImmHi
       SLLI  Xn, Xn, 32
     xxxxxyyy_00000000:
       LUI   Xn, ImmX
       ADDI  Xn, Xn, ImmY
       SLLI  Xn, Xn, 32
     xxxxxyyy_zzzzz000:
       LUI   X5, ImmX
       ADDI  X5, X5, ImmY
       SLLI  X5, X5, 32
       LUI   Xn, ImmZ
       ADD   Xn, Xn, X5
     xxxxxyyy_zzzzzwww:
       LUI   X5, ImmX
       ADDI  X5, X5, ImmY
       SLLI  X5, X5, 32
       LUI   Xn, ImmZ
       ADDI  Xn, Xn, ImmW
       ADD   Xn, Xn, X5
One also needs to use shifts to sign or zero-extend things.
   EXTS.W maps to:
     SLLI  Xn, Xm, 48
     SRAI  Xn, Xn, 48
   EXTU.W maps to:
     SLLI  Xn, Xm, 48
     SRLI  Xn, Xn, 48
   EXTU.L maps to:
     SLLI  Xn, Xm, 32
     SRLI  Xn, Xn, 32
...
The result of all this is a whole lot of Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot of shifts and adds.
BGBCC basically entirely dethrowns the Load and Store ops ...
Possibly more so than GCC, which tended to turn most constant loads into memory loads. It would load a table of constants into a register and then pull constants from the table, rather than compose them inline.
Say, something like:
   AUIPC  X18, X18, DispHi
   ADD    X18, X18, DispLo
   (X18 now holds a table of constants, pointing into .rodata)
And, when it needs a constant:
   LW  Xn, X18, Disp  //offset of the constant it wants.
Or:
   LD  Xn, X18, Disp  //64-bit constant
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
But, better still would be having Jumbo prefixes or similar, or even a SHORI instruction.
Say, 64-bit constant-load in SH-5 or similar:
   xxxxyyyyzzzzwwww
   MOV   ImmX, Rn
   SHORI ImmY, Rn
   SHORI ImmZ, Rn
   SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
FWIW: BJX2 had used this strategy (prior to adding jumbo prefixes) but had called this instruction LDSH instead (but BGBCC also generally also accepts SHORI, and for the XG3 idea decided to go over to calling it SHORI as well).
Comparably, this needs less encoding space vs LUI, and is also more versatile.
But, XG3RV (AKA: CoEx mode) would effectively have access to both:
   Could pull LUI from RV64 or SHORI from XG3.
With jumbo prefixes, SHORI isn't needed as often, but still sometimes useful (or, if for whatever reason one doesn't want to use jumbo prefixes in a given situation; or the code just so happened to express something like "i=(i<<16)|0x5555;" or similar).
Though, had also considered possibly sneaking SHORI into the RV64 encoding space as well by reusing the would-be (but unused) ORIW and XORIW encodings for SHORI+FLDCH and a 17-bit constant load (essentially gluing the Rs1 field onto the immediate to give a 17-bit sign-extended immediate).
But, for now, BGBCC is trying to generate "proper" RV64...
   Even if the situation is "not very good"...

>
Array Load/Store:
   XG2: 1 instruction
   RV64: 3 instructions
>
Global Variable:
   XG2: 1 instruction (if within 2K of GBR)
   RV64: 1 or 4 instructions
>
Constant Load into register (not R5):
   XG2: 1 instruction
   RV64: ~ 1-6
>
Operator with 32-bit immediate:
   BJX2: 1 instruction;
   RV64: 3 instructions.
>
Operator with 64-bit immediate:
   BJX2: 1 instruction;
   RV64: 4-9 instructions.
>
>
Observations (RV64):
   LUI+ADD can't actually represent all possible 32-bit constants.
     Those near the signed-overflow point can't be expressed directly.
   LUI+XOR can get a lot of these cases.
     0x80000000ULL .. 0xFFFFFFFFULL can be partly covered by LUI+XOR.
>
For full 64-bit constants, generally need:
   LUI+ADD+LUI+ADD+SLLI+ADD
And, two registers.
>
There is currently an ugly edge case where BGBCC has to fall back to:
   LUI X5, ImmHi
   ADDI X5, X5, ImmMi
   ( SLLI X5, X5, 12; ADD X5, X5, ImmFrag )+
>
Namely when needing to load a 64-bit constant and R5 is the only register.
>
So, if the compiler tries to emit, say:
   AND R18, 0x7F7F7F7F7F7F7F7F, R10
One may end up with, say:
   LUI X5, 0x7F7F
   ADDI X5, X5, 0x7F8
   SLLI X5, X5, 12
   ADDI X5, X5, 0xF7F
   SLLI X5, X5, 12
   ADDI X5, X5, 0x7F8
   SLLI X5, X5, 12
   ADDI X5, X5, 0xF7F
   AND X10, X18, X5
>
Which, granted, kinda sucks...
 
>
This is partly because BGBCC's code generation currently assumes it can just emit whatever here and the assembler will sort it out.
>
But, this case comes up rarely.
In BJX2, 33 bit cases would be handled by Jumbo prefixes, and generally 64-bit cases by loading the value into R0.
>
In RV64, this is needed for anything that doesn't fit in 12-bits; with X5 taking on the role for scratch constants and similar.
>
...
>
Floating point is still a bit of a hack, as it is currently implemented by shuffling values between GPRs and FPRs, but sorta works.
>
>
RV's selection of 3R compare ops is more limited:
   RV: SLT, SLTU
   BJX2: CMPEQ, CMPNE, CMPGT, CMPGE, CMPHI, CMPHS, TST, NTST
A lot of these cases require a multi-op sequence to implement with just SLT and SLTU.
>
>
Doom isn't quite working correctly yet with BGBCC+RV64 (still has some significant bugs), but in general game logic and rendering now seems to be working.
>
>
But, yeah, generating code for RV is more of a pain as the compiler has to work harder to try to express what it wants to do in the instructions that are available.
>
>
But, yeah, it is what it is...
>
I sort of needed RV64 support for some possible later experiments (the idea for the hybid XG3-CoEx ISA idea would depend on having working RV64 support as a prerequisite).
>
...
>
 

Date Sujet#  Auteur
27 Sep 24 * Misc: BGBCC targeting RV64G, initial results...37BGB
27 Sep 24 +* Re: Misc: BGBCC targeting RV64G, initial results...20Robert Finch
27 Sep 24 i`* Re: Misc: BGBCC targeting RV64G, initial results...19BGB
27 Sep 24 i `* Re: Misc: BGBCC targeting RV64G, initial results...18MitchAlsup1
28 Sep 24 i  `* Re: Misc: BGBCC targeting RV64G, initial results...17BGB
28 Sep 24 i   `* Re: Misc: BGBCC targeting RV64G, initial results...16MitchAlsup1
28 Sep 24 i    `* Re: Misc: BGBCC targeting RV64G, initial results...15BGB
29 Sep 24 i     `* Re: Misc: BGBCC targeting RV64G, initial results...14MitchAlsup1
30 Sep 24 i      `* Re: Misc: BGBCC targeting RV64G, initial results...13BGB
30 Sep 24 i       +- Re: Misc: BGBCC targeting RV64G, initial results...1MitchAlsup1
1 Oct 24 i       `* Re: Misc: BGBCC targeting RV64G, initial results...11Robert Finch
1 Oct 24 i        +- Re: Misc: BGBCC targeting RV64G, initial results...1MitchAlsup1
3 Oct 24 i        `* Re: Misc: BGBCC targeting RV64G, initial results...9BGB
4 Oct 24 i         +* Re: Misc: BGBCC targeting RV64G, initial results...2Robert Finch
4 Oct 24 i         i`- Re: Misc: BGBCC targeting RV64G, initial results...1BGB
6 Oct 24 i         `* Re: Misc: BGBCC targeting RV64G, initial results...6MitchAlsup1
8 Oct 24 i          `* Re: Misc: BGBCC targeting RV64G, initial results...5BGB
8 Oct 24 i           `* Re: Misc: BGBCC targeting RV64G, initial results...4MitchAlsup1
9 Oct 24 i            `* Re: Misc: BGBCC targeting RV64G, initial results...3BGB
9 Oct 24 i             +- Re: Misc: BGBCC targeting RV64G, initial results...1Stefan Monnier
9 Oct 24 i             `- Re: Misc: BGBCC targeting RV64G, initial results...1MitchAlsup1
27 Sep 24 `* Re: Misc: BGBCC targeting RV64G, initial results...16MitchAlsup1
27 Sep 24  +* Re: Misc: BGBCC targeting RV64G, initial results...2BGB
28 Sep 24  i`- Re: Misc: BGBCC targeting RV64G, initial results...1MitchAlsup1
28 Sep 24  `* Re: Misc: BGBCC targeting RV64G, initial results...13Paul A. Clayton
30 Sep 24   `* Re: Misc: BGBCC targeting RV64G, initial results...12MitchAlsup1
16 Oct 24    `* Re: Misc: BGBCC targeting RV64G, initial results...11Paul A. Clayton
16 Oct 24     +* Re: Misc: BGBCC targeting RV64G, initial results...9Stephen Fuld
16 Oct 24     i+- Re: Misc: BGBCC targeting RV64G, initial results...1Thomas Koenig
16 Oct 24     i`* Re: Misc: BGBCC targeting RV64G, initial results...7BGB
17 Oct 24     i `* Re: Misc: BGBCC targeting RV64G, initial results...6MitchAlsup1
17 Oct 24     i  `* Re: Misc: BGBCC targeting RV64G, initial results...5BGB
18 Oct 24     i   `* Re: Misc: BGBCC targeting RV64G, initial results...4MitchAlsup1
21 Oct 24     i    `* Re: Misc: BGBCC targeting RV64G, initial results...3BGB
21 Oct 24     i     `* Re: Misc: BGBCC targeting RV64G, initial results...2MitchAlsup1
22 Oct 24     i      `- Re: Misc: BGBCC targeting RV64G, initial results...1BGB
16 Oct 24     `- Re: Misc: BGBCC targeting RV64G, initial results...1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal