Liste des Groupes | Revenir à c arch |
On 10/1/2024 5:00 AM, Robert Finch wrote:Yes, that would work. But I still think postfixes are a little easier to work with. One can assume no register fetches are needed for the postfix, so the last decoder slot does not need to mux registers. If there was a prefix, there could be an extra set of register ports required. Unless one gets into muxing the ports for only instructions that need them.On 2024-09-29 10:19 p.m., BGB wrote:Interrupts also don't apply to prefixes either, if one assumes that the prefix and following instruction are always decoded at the same time (forming a 64-bit instruction), which also makes them faster.On 9/29/2024 2:11 PM, MitchAlsup1 wrote:One reason I prefer postfix immediates. They are much easier to work with. Interrupts do not cause issues. The instruction plus postfix can be faked to be treated as one giant instruction. The bits following the instruction are often already present on the cache line. It is just a matter of checking for a postfix when decoding the immediate constants.On Sat, 28 Sep 2024 4:30:12 +0000, BGB wrote:>
>On 9/27/2024 7:43 PM, MitchAlsup1 wrote:>On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:>
>
One of the reasons reservation stations became in vouge.
>
Possibly, but is a CPU feature rather than a compiler feature...
A good compiler should be able to make use of 98% of the instruction
set.
Yes, but a reservation station is not part of the ISA proper...
>
>>>------------>>
Saw a video not too long ago where he was making code faster by undoing
a lot of loop unrolling, as the code was apparently spending more in I$
misses than it was gaining by being unrolled.
I noticed this in 1991 when we got Mc88120 simulator up and running.
GBOoO chips are <nearly> best served when there is the smallest number
of instructions.
>
Looking it up, seems the CPU in question (MIPS R4300) was:
16K L1 I$ cache;
8K L1 D$ cache;
No L2 cache (but could be supported off-die);
1-wide scalar, 32 or 64 bit
Non pipelined FPU and multiplier;
...
>
>
Oddly, some amount of these older CPUs seem to have larger I$ than D$, whereas IME the D$ seems to have a higher miss rate (so is easier to justify it being bigger).
>
>>------------>
>
In contrast, a jumbo prefix by itself does not make sense; its meaning
depends on the thing that being is prefixed. Also the decoder will
decode a jumbo prefix and suffix instruction at the same time.
How many bits does one of these jumbo prefixes consume ?
The prefix itself is 32 bits.
In the context of XG3, it supplies 23 or 27 bits.
>
>
For RISC-V ops, they could supply 21 or 26 bits.
>
23+10 = 33 (XG3)
21+12 = 33 (RV op)
27+27+10 = 64 (XG3)
26+26+12 = 64 (RV op)
>
J27 could synthesize an immediate for non-immediate ops:
27+6 = 33 (XG3)
27+5 = 32 (RV)
>
>
For BJX2, the prefixes supply 24 bits (can be stretched to 27 bits in XG2).
24+ 9/10=33 (Base)
24+24+16=64 (Base)
27+27+10=64 (XG2)
>
>
>
But, yeah, perhaps unsurprisingly, the RISC-V people are not so optimistic about the idea of jumbo prefixes...
>
>
Also apparently it seems "yeah, here is a prefix whose primary purpose is just to make the immediate field bigger for the following instruction" is not such an obvious or intuitive idea as I had thought.
>
>
Well, and people obsessing on what happens if an interrupt somehow occurs "between" the prefix and prefixed instruction.
>
Q+ had postfixes that could override a register spec. as well as supply additional constant bits. If an interrupt occurs between the instruction and the postfix, the postfix can be treated as a NOP at the return point.
>
Some amount of the RISC-V people were assuming prefixes that worked like:
Set some magic internal state in a hidden register (1 cycle);
Decoder for next instruction grabs this state, clearing the register.
Which can work for a narrow implementation, but lacks much performance advantage over a multi-op sequence.
I switched the idea over to using:
* 0iiiiii-iiiii-jjjjj-100-kkkkk-00-11011 J21I
* 1iiiiii-iiiii-zzzzz-100-omnpp-00-11011 J21O / J12
* tiiiiii-iiiii-jjjjj-100-kkkkk-00-11011 J22
As a 21-bit prefix for RISC-V ops.
It is incapable of encoding 64-bit immediate values, but can at least encode 33 bit values.
At least, thus far, a combination of jumbo prefixes, indexed load/store, the instructions from Zba, etc. Can gain a roughly code 16% size reduction, and a 40% speed increase.
The biggest single deltas were still seen with indexed load/store and a load/store pair instruction.
Though, there is still a fairly large delta between modified RV and XG2 regarding performance.
Some may be due to the difference between 32 and 64 GPRs, but 64 GPRs via a prefix would be undesirable for code-density (there is no way to expand RV to 64 GPRs for 32-bit encodings short of adding a modal encoding).
Register interlock penalties (stalls due to register RAW) are a little high, but not unreasonable, could be helped by adding instruction- shuffling to the compiler.
Doom performance is suffering some for RV because it lacks a 32x32=>64 widening multiply instruction, and doing "FixedMul" with a 64-bit multiply isn't fast in this case.
Note: Q+ was switched to not using postfixes. Almost everything is single 64-bit instructions now. 64-bit constants can be encoded with just two instructions.One possibility, though wouldn't likely be so good for code-density.
>
In my own extension experiments to RISC-V, I have also added:
LDI Rn, Imm17s
SHORI Rn, Imm16u // Rn=(Rn<<16)|Imm
J21+LDI is not a thing, so instead 33-bit constant loads can be encoded as J21+ADDI.
J21+SHORI causes it to revert to an Imm12 encoding, so:
SHORI Rn, Rs, Imm32u // Rn=(Rn<<32)|Imm
This can load a 64-bit value in two logical instructions (2 clock-cycles with fused decode).
One other possibility is interpreting the prefix as "J22", where:
J22+J22+LUI => LI Rn, Imm64
For double-branch tables, it is mostly constant-time.Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time.How well does JTT work with large tables? What if there are several hundred table entries?
>
>
Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order bits of the link register and similar is not such an elegant idea.
>
>
But, as I see it, still preferable to:
Hey, why not just define a bunch of 48-bit encodings for ALU operations with 32-bit immediate fields?...
>
>
But, like, blarg, this is what I did originally.
And, I dropped all this in favor of jumbo prefixes, because jumbo prefixes did this job better.
>
>
>
Might still experiment with an "Extended RISC-V" and see if in-fact, adding things like jumbo prefixes will make as much of a difference as I expect.
>
Well, probably along with indexed load/store and Zba instructions and similar.
>
I guess, an open question would be if a modified RISC-V variant could be made more performance-competitive with BJX2 without making too much of a mess of things.
>
I could maybe do so, but probably no one would be interested.
>
>
>
Though, looking online, seems I am really the only one calling them "jumbo prefixes". Not sure if there is another more common term used for these things.
>
>----->>>
>
For the jumbo prefix:
Recognize that is a jumbo prefix;
Inform the decoder for the following instruction of this fact
(via internal flag bits);
Provide the prefix's data bits to the corresponding decoder.
>
Unlike a "real" instruction, a jumbo prefix does not need to provide
behavior of its own, merely be able to be identified as such and to
provide payload data bits.
>
>
For now, there are not any encodings larger than 96 bits.
Partly this is because 128 bit fetch would likely add more cost and
complexity than it is worth at the moment.
For your implementation, yes. For all others:: <at best> maybe.
Maybe.
>
I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case.
>
>>
>>>>>>
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to
32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...SUB Rs, $(MIN), R10With pseudo-instructions:
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
>
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
>
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:
>
The ADD is not necessary if min == 0
>
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
>
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
>
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
>
For Q+ indirect jump the values loaded from the table replace the low order bits of the PC instead of being a displacement. Only {W,T,O} are supported. (W=wyde,T=tetra,O=octa). Should add an option for displacements. Borrowed the memory indirect jump from the 68k.
>
Though, in BGBCC, use of branch-tables was subject to some constraints:
Between 8 and 256 entries;
More than 75% full over the covered range.
Less than 8:
It decays into an "if(val==case_tag)goto case_lbl;" chain;
Chain is terminated by a "goto lbl_default;"
More than 256 or less than 75% full:
It is binary subdivided;
Each half is then subject to the same evaluation.
So, for large or sparse "switch()" blocks, it approaches O(log2 n).
Do remember at one point (some time last decade), managed to crash Clang with code with a large "switch()" block. I forget how big the switch was now, but IIRC it was well into the thousands of case labels (and, apparently Clang had used a std::array or similar which had freaked out when its hard size limit was hit).
Then again, not necessarily like BGBCC would do well here either. As-is, it has a limit of only allowing a given local variable to be assigned fewer than 4096 times in a given function.
And, probably at least something is going to break if one throws a 10k+ case "switch()".
Though, my coding styles have mostly moved away from using things like large procedural generated switch blocks and similar (tends to result in slow build times and lots of bloat).
This sort of thing killed off my "FRVM" project, but FRVM was the direct predecessor to some of the technology used in my BJX1 and BJX2 projects (the VMs use a vaguely similar strategy albeit absent the procedural code generation parts). Parts of the FRVM backend served as the basis for much of the structure that became my current BGBCC backends.
Well, apart from the architectural wonk that I started working to eliminate RIL, but then soon found that I still needed RIL, so the middle stages of the compiler are kinda conjoined in an ugly way.
Well, partly a similar sort of mess also existing in my backend, as initially I thought I wasn't going to need an assembler.
Well, and now effectively BJX2 and RISC-V now awkwardly exist within the same compiler backend (that was originally written to target SH4).
>>>>----->
Potentially it could be more compact.
Both more compact and just as fast; many times faster.
Something like this might be worth considering.
>
Would likely be a pretty useful instruction for something like a bytecode interpreter, which would be more sensitive to the performance of things like "switch()", ...
>
>
Les messages affichés proviennent d'usenet.