On 9/27/2024 7:43 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:
On 9/27/2024 2:40 PM, MitchAlsup1 wrote:
On Fri, 27 Sep 2024 18:26:28 +0000, BGB wrote:
>
But, generally this does still impose limits:
Can't reorder instructions across a label;
Can't move instructions with an associated reloc;
I always did code motion prior to assembler. Code motion only has to
consider:: 1-operand, 2-operand, 3-operand, branch, label, LD, ST.
Could make sense.
Would need to add an intermediate queue to the code generation process. This was planned at one point, but ended up not being implemented.
I had ended up adding it instead at the level of machine code, after the machine-code was emitted but before relocs were applied.
However, at this level, the mechanism can only deal with a single ISA.
So, the code written for BJX2 will not work as-is for RISC-V.
Can't reorder memory instructions unless they can be proven to not alias
(loads may be freely reordered, but the relative order of loads and
stores may not unless provably non-aliasing);
Same base register different displacement.
This is one of the heuristics.
So, Common base but different displacement;
SP and GP are assumed to never alias;
...
The effectiveness of this does depend on how the C code is written
though (works favorably with larger blocks of mostly-independent
expressions).
One of the reasons reservation stations became in vouge.
Possibly, but is a CPU feature rather than a compiler feature...
Inlining and modulo-scheduled loop unrolling could help, but (even if implemented) could also be a foot gun:
As much as one could unroll and modulo-schedule something to help performance, it could also make it worse (say, a loop that only ever executes with small N might end up slower once unrolled than before, or may not be in the hot-path and thus lead to code bloat, ...).
Might be better if C had loop unrolling hints, but alas.
Saw a video not too long ago where he was making code faster by undoing a lot of loop unrolling, as the code was apparently spending more in I$ misses than it was gaining by being unrolled.
Though, the video was talking about code running on an old MIPS chip with, apparently:
Fairly small direct-mapped L1 I$;
No L2 cache.
Which caused the CPU to be fairly sensitive to performance from L1 I$ misses, and a performance hit whenever a loop body became too large to fit into the L1 I$ (or in other cases where code alignment was unfavorable).
Most agree it is closer to 30% than 25% {{Unless you clutter up the ISA
such that your typical memref needs a support instruction.
>
>
Cough, RV64...
-----
Which makes that 16% (above) into 48% and renormalizing to::
~ 63% fixed-displacement;
~ 36% register-indexed and support instructions.
>
Yeah.
>
I think there are reasons here why I am generally getting lackluster
performance out of RV64...
-----
Comparably, XG2 has a 16K or 32K reach here (depending on immediate
size), which hits most of the global variables. The fallback Jumbo
encoding hits the rest.
>
I get ±32K with 16-bit displacements
>
>
Baseline has special case 32-bit ops:
MOV.L (GBR, Disp10u), Rn //4K
MOV.Q (GBR, Disp10u), Rn //8K
>
But, in XG2, it gains 2 bits:
MOV.L (GBR, Disp12u), Rn //16K
MOV.Q (GBR, Disp12u), Rn //32K
>
Jumbo can encode +/- 4GB here (64-bit encoding).
MOV.L (GBR, Disp33s), Rn //+/- 4GB
MOV.Q (GBR, Disp33s), Rn //+/- 4GB
>
Mostly because GBR displacements are unscaled.
Plan for XG3 is that all Disp33s encodings would be unscaled.
The assembler gets to choose based on the memory model::
MEM Rd,[Rb,Ri<<s,DISP]
Assembler (or even linker) can choose 32-bit or 64 bit based on a
variety
of things {flags, memory model, size of linked module,...}
There is a thing where my compiler will dry-run the codegen over the program to figure out how big everything may be (say, to figure out if ".text" is big enough that the plain 32-bit branch encodings may not be sufficient, etc).
BJX2 can also do (PC, Disp33s) in a single logical instruction...
>
But, RISC-V can't...
>
What is your definition of "single logical instruction". In my parlance,
a single logical instruction can be::
ST #64-bit-const,[Rb,Ri<<s,DISP64]
is 1 instruction occupying 5 words.
Generally:
Decoded all at the same time;
Is seen to execute as a single atomic operation (non interruptible);
If pipelined, generally executes in a single clock cycle;
Is small enough to fit into a window being fetched from the L1 I$;
...
Fusion wouldn't count as the instruction sequence would otherwise seem to exist as multiple sequential operations.
A normal instruction bundle wouldn't count as each instruction in the bundle is logically independent of the others.
I don't count pseudo-instructions, as these may appear as one instruction in assembler, but are decomposed into smaller parts for the CPU.
So:
AUIPC Xn, DispHi
LW Xn, Xn, DispLo
Is not a single instruction, even if fused, as the AUIPC still makes sense on its own, and you could put an interrupt between them and the execution would still make sense.
Similarly, ability to write:
LI Xn, Imm32
Would not confer status to "LI" as a real instruction, since it decomposes into a pair (LUI + ADDI).
In contrast, a jumbo prefix by itself does not make sense; its meaning depends on the thing that being is prefixed. Also the decoder will decode a jumbo prefix and suffix instruction at the same time.
I had before considered the possibility of a small 1-wide core decoding jumbo-prefixes sequentially; but this hasn't generally made sense vs other options:
Making the prefixes optional, and not using if targeting a small core;
Transposing the words in the decoder, which still works well for decoding 96-bit ops with a 2 wide decoder (tough, at 1 wide things would still be a little wonky, but more for pipeline reasons than decoder reasons at this point).
Though, another conceptual way to approach fetch might be to use "right justified fetch" rather than transpose:
Say, rather than fetching the 12 bytes starting at PC;
One is fetching the 12 bytes preceeding NextPC.
But, potentially, right-justified fetch is even more convoluted than transposed decoding;
Right justified decode could make sense, but is practically little different than word-transposed decoding (they would differ primarily in terms of the conceptual order of the instruction decoders).
Say:
DecodeC DecodeB DecodeA
------- ------- aaaaaaa //1-wide
------- aaaaaaa bbbbbbb //2-wide
aaaaaaa bbbbbbb ccccccc //3-wide
In either case, for a 2-wide core, one might have a DecodeC whose functionality is reduced to simply being stub logic for dealing with jumbo prefixes.
Similarly, J+OP may change the logical interpretation of OP, leading to a different instruction being decoded than had OP been encountered by itself.
Say, for example, if encountered together, one might see a SIMD operation, but if split in half, one sees a jumbo-prefix (as an ill-defined blob of bits) and a SHAD instruction or similar (well, say, because "SHAD with big immediate" doesn't really make sense otherwise, so is effectively "free real estate" for things like SIMD ops...).
On the other hand, a jumbo prefix is always a jumbo prefix:
The understanding of whether or not it is a jumbo prefix does not depend on the ability of the following instruction to accept a jumbo prefix.
This differs, say, from recognizing LUI+ADDI and fusing them as one needs to recognize both the LUI and ADDI as able to fuse with each other and also invoke special case behavior if they are using the same register.
For the jumbo prefix:
Recognize that is a jumbo prefix;
Inform the decoder for the following instruction of this fact
(via internal flag bits);
Provide the prefix's data bits to the corresponding decoder.
Unlike a "real" instruction, a jumbo prefix does not need to provide behavior of its own, merely be able to be identified as such and to provide payload data bits.
For now, there are not any encodings larger than 96 bits.
Partly this is because 128 bit fetch would likely add more cost and complexity than it is worth at the moment.
>
>
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to 32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
>
How are you going to do dense PIC switch() {...} in RISC-V ??
>
Already implemented...
>
With pseudo-instructions:
SUB Rs, $(MIN), R10
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
>
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
>
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:
The ADD is not necessary if min == 0
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
-----
Potentially it could be more compact.
In premise, I could further reduce the "JMPTAB" operation, at least for BJX2:
SUB Rs, $(MIN), R4
CMPHI $(MAX-MIN), R4
BT Lbl_Dfl
BRA.L R4
BRA Lbl_Case0
BRA Lbl_Case1
...
Where I could use BRA.L and save 1 instruction over the use of BRAF, but OTOH:
BRA.L uses a 4-byte scale vs the 2-byte scale used by every other branch op, if used would only ever see use in JMPTAB.
In this case, BRAF was basically grandfathered in from SuperH (where its roles were typically also in implementing jump tables and instruction slides). But, as noted, also has the slight annoyance of assuming a 16-bit instruction size (whereas jump-tables generally use 32-bit branch encodings, and 16-bit ops are not a thing in XG2 Mode).
Granted, a JTT-like instruction could make sense if behaviorally similar to a BRAF or BRA.L:
JTTL Imm10, Rn
Rn<Imm10u:
Branch to PC+4+Rn*4
Else:
Branch to PC+4+Imm10*4
Where the else case is understood as holding a branch to the "default" label.
SUB Rs, $(MIN), R4
JTTL $(MAX-MIN), R4
BRA Lbl_Case0
BRA Lbl_Case1
Could poke at it, but saving a few clock cycles on "switch()" dispatch isn't an immediate priority.
Currently, BGBCC does not use this strategy.
Though, for 64-bit constants it could be more compact and faster.
>
But, better still would be having Jumbo prefixes or similar, or even a
SHORI instruction.
>
Better Still Still is having 32-bit and 64-bit constants available
from the instruction stream and positioned in either operand position.
>
>
Granted...
>
>
Say, 64-bit constant-load in SH-5 or similar:
xxxxyyyyzzzzwwww
MOV ImmX, Rn
SHORI ImmY, Rn
SHORI ImmZ, Rn
SHORI ImmW, Rn
Where, one loads the constant in 16-bit chunks.
>
Yech
>
>
But, 4 is still less than 6.
1 is less than 4, too.
Granted...
As noted, I also have what is effectively:
J_IMM Imm24Hi; J_IMM Imm24Mi; MOV Imm16Lo, Rn
Being understood as a single 64-bit load.
Fetched all at once;
Decoded all at once;
Executes all at once in the pipeline.
Generally as a pattern:
LaneA Decoder sees that it is flagged as having 2 jumbo prefixes;
Decodes according to the 2-jumbo-prefix rules;
Fills its lane's 33 bit output.
LaneB decoder sees that it is a jumbo prefix;
Fills its lane's 33 bit immediate.
A later stage sees we are doing a 64-bit constant load, and glues the lane A and B immediate fields together to produce the output value.
By itself, each jumbo prefix decodes as a NOP.