Liste des Groupes | Revenir à c arch |
On 9/27/2024 7:50 AM, Robert Finch wrote:RISC-V group opinion is that "we have done nothing to damage pipelineOn 2024-09-27 5:46 a.m., BGB wrote:---------
>
But, BJX2 does not spam the ADD instruction quite so hard, so is more
forgiving of latency. In this case, an optimization that reduces
common-case ADD to 1 cycle was being used (it only works though in the
CPU core if the operands are both in signed 32-bit range and no overflow
occurs; IIRC optionally using a sign-extended AGU output as a stopgap
ALU output before the output arrives from the main ALU the next cycle).
>
>Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA>Comparably, it appears BGBCC leans more heavily into ADD and SLLI than>
GCC does, with a fair chunk of the total instructions executed being
these two (more cycles are spent adding and shifting than doing memory
load or store...).
That seems to be a bit off. Mem ops are usually around 1/4 of
Which makes that 16% (above) into 48% and renormalizing to::instructions. Spending more than 25% on adds and shifts seems like a>
lot. Is it address calcs? Register loads of immediates?
>
It is both...
>
>
In BJX2, the dominant instruction tends to be memory Load.
Typical output from BGBCC for Doom is (at runtime):
~ 70% fixed-displacement;
~ 30% register-indexed.
Static output differs slightly:
~ 84% fixed-displacement;
~ 16% register-indexed.
>
RV64G lacks register-indexed addressing, only having fixed displacement.
>
If you need to do a register-indexed load in RV64:
SLLI X5, Xo, 2 //shift by size of index
ADD X5, Xm, X5 //add base and index
LW Xn, X5, 0 //do the load
>
This case is bad...
>Illustrating the falicy of 12-bits of displacement.
>
Also global variables outside the 2kB window:
LUI X5, DispHi
ADDI X5, X5, DispLo
ADD X5, GP, X5
LW Xn, X5, 0
>
Where, sorting global variables by usage priority gives:
~ 35%: in range
~ 65%: not in range
Comparably, XG2 has a 16K or 32K reach here (depending on immediateI get ±32K with 16-bit displacements
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
>MEM Rd,[IP,,DISP32/64] // IP-rel
Theoretically, could save 1 instruction here, but would need to add two
more reloc types to allow for:
LUI, ADD, Lx
LUI, ADD, Sx
Because annoyingly Load and Store have different displacement encodings;
and I still need the base form for other cases.
>
>
More compact way to load/store global variables would be to use absolute
32-bit or PC relative:
LUI + Lx/Sx : Abs32
AUIPC + Lx/Sx : PC-Rel32
>How are you going to do dense PIC switch() {...} in RISC-V ??
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
Main difference being that FDPIC uses fatunnecessary
function pointers and does the GP reload on the caller, vs PBO where I
use narrow function pointers and do the reload on the callee (with
load-time fixups for the PBO Offset).
>
>
The result of all this is a whole lot of
Shifts and ADDs.
Seemingly, even more for BGBCC than for GCC, which already had a lot ofBetter Still Still is having 32-bit and 64-bit constants available
shifts and adds.
>
BGBCC basically entirely dethrowns the Load and Store ops ...
>
>
Possibly more so than GCC, which tended to turn most constant loads into
memory loads. It would load a table of constants into a register and
then pull constants from the table, rather than compose them inline.
>
Say, something like:
AUIPC X18, X18, DispHi
ADD X18, X18, DispLo
(X18 now holds a table of constants, pointing into .rodata)
>
And, when it needs a constant:
LW Xn, X18, Disp //offset of the constant it wants.
Or:
LD Xn, X18, Disp //64-bit constant
>
>
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
>
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
Say, 64-bit constant-load in SH-5 or similar:Yech
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
Don't you ever snip anything ??>
>
Les messages affichés proviennent d'usenet.