Liste des Groupes | Revenir à c arch |
On 2024-09-29 10:19 p.m., BGB wrote:Interrupts also don't apply to prefixes either, if one assumes that the prefix and following instruction are always decoded at the same time (forming a 64-bit instruction), which also makes them faster.On 9/29/2024 2:11 PM, MitchAlsup1 wrote:One reason I prefer postfix immediates. They are much easier to work with. Interrupts do not cause issues. The instruction plus postfix can be faked to be treated as one giant instruction. The bits following the instruction are often already present on the cache line. It is just a matter of checking for a postfix when decoding the immediate constants.On Sat, 28 Sep 2024 4:30:12 +0000, BGB wrote:>
>On 9/27/2024 7:43 PM, MitchAlsup1 wrote:>On Fri, 27 Sep 2024 23:53:22 +0000, BGB wrote:>
>
One of the reasons reservation stations became in vouge.
>
Possibly, but is a CPU feature rather than a compiler feature...
A good compiler should be able to make use of 98% of the instruction
set.
Yes, but a reservation station is not part of the ISA proper...
>
>>>------------>>
Saw a video not too long ago where he was making code faster by undoing
a lot of loop unrolling, as the code was apparently spending more in I$
misses than it was gaining by being unrolled.
I noticed this in 1991 when we got Mc88120 simulator up and running.
GBOoO chips are <nearly> best served when there is the smallest number
of instructions.
>
Looking it up, seems the CPU in question (MIPS R4300) was:
16K L1 I$ cache;
8K L1 D$ cache;
No L2 cache (but could be supported off-die);
1-wide scalar, 32 or 64 bit
Non pipelined FPU and multiplier;
...
>
>
Oddly, some amount of these older CPUs seem to have larger I$ than D$, whereas IME the D$ seems to have a higher miss rate (so is easier to justify it being bigger).
>
>>------------>
>
In contrast, a jumbo prefix by itself does not make sense; its meaning
depends on the thing that being is prefixed. Also the decoder will
decode a jumbo prefix and suffix instruction at the same time.
How many bits does one of these jumbo prefixes consume ?
The prefix itself is 32 bits.
In the context of XG3, it supplies 23 or 27 bits.
>
>
For RISC-V ops, they could supply 21 or 26 bits.
>
23+10 = 33 (XG3)
21+12 = 33 (RV op)
27+27+10 = 64 (XG3)
26+26+12 = 64 (RV op)
>
J27 could synthesize an immediate for non-immediate ops:
27+6 = 33 (XG3)
27+5 = 32 (RV)
>
>
For BJX2, the prefixes supply 24 bits (can be stretched to 27 bits in XG2).
24+ 9/10=33 (Base)
24+24+16=64 (Base)
27+27+10=64 (XG2)
>
>
>
But, yeah, perhaps unsurprisingly, the RISC-V people are not so optimistic about the idea of jumbo prefixes...
>
>
Also apparently it seems "yeah, here is a prefix whose primary purpose is just to make the immediate field bigger for the following instruction" is not such an obvious or intuitive idea as I had thought.
>
>
Well, and people obsessing on what happens if an interrupt somehow occurs "between" the prefix and prefixed instruction.
>
Q+ had postfixes that could override a register spec. as well as supply additional constant bits. If an interrupt occurs between the instruction and the postfix, the postfix can be treated as a NOP at the return point.
Note: Q+ was switched to not using postfixes. Almost everything is single 64-bit instructions now. 64-bit constants can be encoded with just two instructions.One possibility, though wouldn't likely be so good for code-density.
For double-branch tables, it is mostly constant-time.Which, as I have tended to implement them, is simply not possible, since everything is fetched and decoded at the same time.How well does JTT work with large tables? What if there are several hundred table entries?
>
>
Granted, yes, it does add the drawback of needing to have tag-bits to remember the mode, and maybe the CPU hiding mode bits in the high order bits of the link register and similar is not such an elegant idea.
>
>
But, as I see it, still preferable to:
Hey, why not just define a bunch of 48-bit encodings for ALU operations with 32-bit immediate fields?...
>
>
But, like, blarg, this is what I did originally.
And, I dropped all this in favor of jumbo prefixes, because jumbo prefixes did this job better.
>
>
>
Might still experiment with an "Extended RISC-V" and see if in-fact, adding things like jumbo prefixes will make as much of a difference as I expect.
>
Well, probably along with indexed load/store and Zba instructions and similar.
>
I guess, an open question would be if a modified RISC-V variant could be made more performance-competitive with BJX2 without making too much of a mess of things.
>
I could maybe do so, but probably no one would be interested.
>
>
>
Though, looking online, seems I am really the only one calling them "jumbo prefixes". Not sure if there is another more common term used for these things.
>
>----->>>
>
For the jumbo prefix:
Recognize that is a jumbo prefix;
Inform the decoder for the following instruction of this fact
(via internal flag bits);
Provide the prefix's data bits to the corresponding decoder.
>
Unlike a "real" instruction, a jumbo prefix does not need to provide
behavior of its own, merely be able to be identified as such and to
provide payload data bits.
>
>
For now, there are not any encodings larger than 96 bits.
Partly this is because 128 bit fetch would likely add more cost and
complexity than it is worth at the moment.
For your implementation, yes. For all others:: <at best> maybe.
Maybe.
>
I could maybe consider widening fetch/decode to 128-bits if there were a compelling use case.
>
>>
>>>>>>
Likewise, no one seems to be bothering with 64-bit ELF FDPIC for RV64
(there does seem to be some interest for ELF FDPIC but limited to
32-bit
RISC-V ...). Ironically, ideas for doing FDPIC in RV aren't too far off
from PBO (namely, using GP for a global section and then chaining the
sections for each binary).
How are you going to do dense PIC switch() {...} in RISC-V ??
Already implemented...SUB Rs, $(MIN), R10With pseudo-instructions:
MOV $(MAX-MIN), R11
BGTU R11, R10, Lbl_Dfl
>
MOV .L0, R6 //AUIPC+ADD
SHAD R10, 2, R10 //SLLI
ADD R6, R10, R6
JMP R6 //JALR X0, X6, 0
>
.L0:
BRA Lbl_Case0 //JAL X0, Lbl_Case0
BRA Lbl_Case1
...
Compared to::
// ADD Rt,Rswitch,#-min
JTT Rt,#max
.jttable min, ... , max, default
adder:
>
The ADD is not necessary if min == 0
>
The JTT instruction compared Rt with 0 on the low side and max
on the high side. If Ri is out of bounds, default is selected.
>
The table displacements come in {B,H,W,D} selected in the JTT
(jump through table) instruction. Rt indexes the table, its
signed value is <<2 and added to address which happens to be
address of JTT instruction + #(max+1)<<entry. {{The table is
fetched through the ICache with execute permission}}
>
Thus, the table is PIC; and generally 1/4 the size of typical
switch tables.
For Q+ indirect jump the values loaded from the table replace the low order bits of the PC instead of being a displacement. Only {W,T,O} are supported. (W=wyde,T=tetra,O=octa). Should add an option for displacements. Borrowed the memory indirect jump from the 68k.
>>----->
Potentially it could be more compact.
Both more compact and just as fast; many times faster.
Something like this might be worth considering.
>
Would likely be a pretty useful instruction for something like a bytecode interpreter, which would be more sensitive to the performance of things like "switch()", ...
>
>
Les messages affichés proviennent d'usenet.