( Was distracted with other stuff, mostly trying to find and fix bugs. )
On 3/27/2024 4:03 PM, MitchAlsup1 wrote:
BGB wrote:
On 3/26/2024 7:02 PM, MitchAlsup1 wrote:
>
Sometimes I see a::
>
CVTSD R2,#5
>
Where a 5-bit immediate (value = 5) is converted into 5.0D0 and placed in register R2 so it can be accesses as an argument in the subroutine call
to happen in a few instructions.
>
I had looked into, say:
FADD Rm, Imm5fp, Rn
Where, despite Imm5fp being severely limited, it had an OK hit rate.
Unpacking imm5fp to Binary16 being, essentially:
aee.fff -> 0.aAAee.fff0000000
Self correction: aee.ff
realistically ±{0, 1, 2, 3, 4, 5, .., 31} only misses a few of the often
used fp constants--but does include 0, 1, 2, and 10. Also, realistically
the missing cases are the 0.5s.
This scheme had (rounded):
* 000xx: 3000, 3100, 3200, 3300 ( 0.125, 0.156, 0.188, 0.219)
* 001xx: 3400, 3500, 3600, 3700 ( 0.250, 0.313, 0.375, 0.438)
* 010xx: 3800, 3900, 3A00, 3B00 ( 0.500, 0.625, 0.750, 0.875)
* 011xx: 3C00, 3D00, 3E00, 3F00 ( 1.000, 1.250, 1.500, 1.750)
* 100xx: 4000, 4100, 4200, 4300 ( 2.000, 2.500, 3.000, 3.500)
* 101xx: 4400, 4500, 4600, 4700 ( 4.000, 5.000, 6.000, 7.000)
* 110xx: 4800, 4900, 4A00, 4B00 ( 8.000, 10.000, 12.000, 14.000)
* 111xx: 4C00, 4D00, 4E00, 4F00 (16.000, 20.000, 24.000, 28.000)
Of the options evaluated, this seemed to get the best general-case hit rate.
OTOH, can note that a majority of typical floating point constants can be represented exactly in Binary16 (well, excluding "0.1" or similar), so it works OK as an immediate format.
This allows a single 32-bit op to be used for constant loads (nevermind if one needs a 96 bit encoding for 0.1, or PI, or ...).
Mostly, a floating point immediate is available from a 32-bit constant
container. When accesses in a float calculation it is used as IEEE32
when accessed by a 6double calculation IEEE32->IEEE64 promotion is
performed in the constant delivery path. So, one can use almost any
floating point constant that is representable in float as a double
without eating cycles and while saving code footprint.
>
Don't currently have the encoding space for this.
Could in theory pull off truncated Binary32 an Imm29s form, but not likely worth it. Would also require putting a converted in the ID2 stage, so not free.
In this case, the issue is more one of LUT cost to support these cases.
For FPU compare with zero, can almost leverage the integer compare ops, apart from the annoying edge cases of -0.0 and NaN leading to "not strictly equivalent" behavior (though, an ASM programmer could more easily get away with this). But, not common enough to justify adding FPU specific ops for this.
>
Actually, the edge/noise cases are not that many gates.
a) once you are separating out NaNs, infinities are free !!
b) once you are checking denorms for zero, infinites become free !!
>
Having structured a Compare-to-zero circuit based on the fields in double;
You can compose the terns to do all signed and unsigned integers and get
a gate count, then the number of gates you add to cover all 10 cases of floating point is 12% gate count over the simple integer version. Also
note:: this circuit is about 10% of the gate count of an integer adder.
>
I could add them, but, is it worth it?...
Whether to add them or not is on you.
I found things like this to be more straw on the camel's back {where the camel collapses to a unified register file model.}
I already have a unified register file in my ISA (only the RISC-V mode treats it as two sub-spaces).
But, in this case:
BEQ/BNE/BGT/... almost work with floating-point values, apart from the annoying edge cases like -0 and similar.
If substituted for floating-point comparison, most things work as expected, except that Quake ends up with some minor physics glitches (*1) and similar (mostly due to the occasional appearance of -0, and it not comparing equal to +0 in this case).
*1: Most commonly in the form that sliding a bounding box along the ground will occasionally encounter seams where it gets stuck and can't slide across (so the player would need to jump over these seams).
In this case, it is more a question of encoding space than logic cost.
It is semi-common in FP terms, but likely not common enough to justify dedicated compare-and-branch ops and similar (vs the minor annoyance at the integer ops not quite working correctly due to edge cases).
My model requires about ½ the instruction count when processing FP
comparisons compared to RISC-V (big, no; around 5% in FP code and 0 elsewhere.} Where it wins big is compare against a non-zero FP constant. My 66000 uses 1 instructions {FCMP, BB} whereas RISC=V uses 4 {AUPIC, LD, FCMP, BC}
At present, options are:
FLDCH + FCMPxx (2R) + ( BT/BF | MOVT/MOVNT )
FCMPxx (2RI) + ( BT/BF | MOVT/MOVNT )
The latter currently requires the FpImm extension.
The integer ops currently have some additional denser cases:
Compare and Branch
Compare-Zero and Branch
Compare and Set (3R and small 3RI).
However, adding these cases for FP would eat encoding space, and floating-point compare is less common than integer compare.
Granted, it does look like FCMPGT is around 0.7% of the clock-cycle budget in GLQuake, so isn't too far down on the list. Would need to evaluate whether is is more commonly being used for branching or as a predicate or similar, and the relative amount of compare-with-zero vs comparing two values.
Can note that FCMPEQ is much further down the list, at 0.03%.
-----------------------
>
Seems that generally 0 still isn't quite common enough to justify having one register fewer for variables though (or to have a designated zero register), but otherwise it seems there is not much to justify trying to exclude the "implicit zero" ops from the ISA listing.
>
>
It is common enough,
But there are lots of ways to get a zero where you want it for a return.
>
I think the main use case for a zero register is mostly that it allows using it as a special case for pseudo-ops. I guess, not quite the same if it is a normal GPR that just so happens to be 0.
Recently ended up fixing a bug where:
y=-x;
Was misbehaving with "unsigned int":
"NEG" produces a value which falls outside of UInt range;
But, "NEG; EXTU.L" is a 2-op sequence.
It had the EXT for SB/UB/SW/UW, but not for UL.
For SL, bare NEG almost works, apart from ((-1)<<31).
Could encode it as:
SUBU.L Zero, Rs, Rn
ADD Rn,#0,-Rs
But notice::
y = -x;
a = b + y;
can be performed as if it had been written::
y = -x;
a = b + (-x);
Which is encoded as::
ADD Ry,#0,-Rx
ADD Ra,Rb,-Rx
But, without a zero register,
#0 is not a register, but its value is 0x0000000000000000 anyway.
You missed the point entirely, if you can get easy access to #0
then you no longer need a register to hold this simple bit pattern.
In fact a large portion of My 66000 ISA over RISC-V comes from this
mechanism.
OK.
I did end up switching "-x" for Int and UInt over to ADDS.L and ADDU.L with the zero in a register. Noted that on-average, this is still faster than "NEG + EXTx.L" even in the absence of a zero register.
But, then ended up fighting with a bunch of compiler bugs, as some recent behavioral tweaks in the compiler had started stepping on a bunch of buggy cases:
Some instructions which were invalid as prefixes were being set as prefixes in the WEXifier;
The edge case for expanding Op32 to Op64 was mangling large immediate values (mostly related to cases dealing with encoding R32..R63 in Baseline mode);
The handling for "tiny leaf functions" was incorrectly allowing functions to be compiled as tiny-leaf functions in cases where they had a by-value structure type on the local frame (in this case, the local struct address would end up as "whatever value was already in the register", potentially stomping stuff elsewhere in memory, *1).
*1: TBD if this is related to some of the other memory-stomping/corruption bugs I had been seeing (which were sensitive to register allocator behavior). It is possible this was it, or they may yet still be other bugs that result in memory-stomping.
This sort of stuff has eaten up a lot of my time recently.
the compiler needs to special-case
The compiler needs easy access to #0 and the compiler needs to know
that #0 exists, but the compiler does not need to know if some register
contains that same bit pattern.
Conventionally, this was done with "pseudo-instructions", but for a pseudo instruction, one needs to know that zero exists in some location.
Otherwise, one can leave it up to the register allocator ("hey, give me a register that holds zero").
Though, looking at the compiler output, it still doesn't yet seem to realize that immediate 0 is 0 regardless of type. So, it is treating 0, 0U, 0LL, NULL, ... effectively as different values (and allocating them in different registers). May need to address this...
provision this (or, in theory, add a "NEGU.L" instruction, but doesn't seem common enough to justify this).
....
It is less bad than 32-bit ARM, where I only burnt 2 bits, rather than 4.
I burned 0 per instruction, but you can claim I burned 1 instruction PRED and
6.4 bits of that instruction are used to create masks that project upon up to
8 following instructions.
Yeah.
Say:
CMPxx OP1?T OP2?T OP3?F OP4?F
Is one less instruction word than, say:
CMPxx PRED OP1 OP2 OP3 OP4
And, as noted, most predication blocks tend to be short.
Though, one could argue, say (very rough stats):
Scalar: ~ 55%
WEX : ~ 17% (has gone up slightly)
Jumbo : ~ 15% (this is just the jumbo prefixes)
Pred : ~ 13% (6.5% for ?T and ?F)
PrWEX : ~ 0.4%
Would, in terms of entropy, favor spending 0 or 1 bits per instruction on this.
But, spending 2 bits here allows the encoding space to be more orthogonal (at least in XG2). So, nearly any unconditional instruction can also be encoded in WEX'ed or Predicated forms (there are a few large unconditional-only instructions; their alternate forms used to encode the Jumbo prefixes and PrWEX blocks).
Like, a lopsided encoding space would be less ideal even if "more optimal" in a Huffman sense...
Though, FWIW, 25% isn't *that* far off from the ~17% of ops flagged as WEX in this example; granted, one could argue that superscalar could deal with this.
Granted, potentially a clever CPU could also avoid needing to burn an extra clock-cycle on a PRED op or similar, or 2 cycles to perform an unconditional branch, ... But, a naive CPU will likely not be able to avoid these sorts of costs.
Granted, some past architectures were like, "well, an instruction would be there already..." and treat the cycle following the branch as a branch-delay-slot. Granted, had resisted that urge, and had previously noted that if the delay-slot would contain an instruction that triggers a pipeline interlock stall, this would foul up the pipeline and cause the branch-initiation process to fail (and the "fix" here would have been a lot more painful if the ISA actually had branch-delay slots; rather than just the would-be delay slot instruction having failed to be "sufficiently ignored").
In Baseline mode, the encoding-space breakdown differs:
75%: 16-bit ops
12.5%: 32-bit ops (7x/9x), R32..R64,
Scalar ops only covering a subset of the ISA.
6.25%: Pred (Ex)
6.25%: Scalar and WEX (R0..R31 only, Fx).
Or, if no XGPR:
87.5%: 16-bit ops.
6.25%: Pred (Ex)
6.25%: Scalar and WEX (R0..R31 only, Fx).
Where the encoding of the Ex and Fx blocks is common between Baseline and XG2 (allowing for a common subset to exist).
Comparably, XG2 Mode, despite only 1/4 of the encoding space being used for Scalar ops, generally still ends up expanding some of the immediate fields and similar vs Baseline mode (and mostly avoiding the need to use Op64 encodings due to the more orthogonal encodings).
Also seems like a reasonable tradeoff, as the 2 bits effectively gain:
Per-instruction predication;
WEX / Bundle encoding;
Jumbo prefixes;
...
But, maybe otherwise could have justified slightly bigger immediate fields, dunno.
While poking at it, did go and add a check to exclude large struct-copy operations from predication, as it is slower to turn a large struct copy into NOPs than to branch over it.
>
Did end up leaving struct-copies where sz<=64 as allowed though (where a 64 byte copy at least has the merit of achieving full pipeline saturation and being roughly break-even with a branch-miss, whereas a 128 byte copy would cost roughly twice as much as a branch miss).
>
I decided to bite the bullet and have LDM, STM and MM so the compiler does
not have to do any analysis. This puts the onus on the memory unit designer
to process these at least as fast as a series of LDs and STs. Done right
this saves ~40%of the power of the caches avoiding ~70% of tag accesses
and 90% of TLB accesses. You access the tag only when/after crossing a line boundary and you access TLB only after crossing a page boundary.
OK.
In my case, it was more a case of noting that sliding over, say, 1kB worth of memory loads/stores, is slower than branching around it.
This is why My 66000 predication has use limits. Once you can get where you want faster with a branch, then a branch is what you should use.
I reasoned that my 1-wide machine would fetch 16-bytes (4 words) per
cycle and that the minimum DECODE time is 2 cycles, that Predication
wins when the number of instructions <= FWidth × Dcycles = 8.
Use predication and save cycles by not disrupting the front end.
Use branching and save cycles by disrupting the front end.
In this case, where and how much to predicate is left up to the compiler frontend, and it doesn't need to know in advance exactly how many instructions will be emitted by the backend (but, does need to avoid anything which might potentially result in an implicit runtime call or similar).
In this case, the compiler decides this at the AST level, but there is an uncertainty how many CPU instructions would be covered by the AST (and the front-end hand-waves it mostly by counting AST nodes and using each AST node as a stand-in for an instruction; rejecting cases which have too many AST nodes).
But, had needed to fiddle with stuff a bit to fix up some edge cases that were popping up here:
A struct-copy may emit a much longer sequence than intended;
Some seemingly benign cases, like "ptr-ptr" were resulting in "evil" cases with some types (like "Foo*" where "sizeof(Foo)" is not a power of 2, resulting in an implicit integer division, so then one needs to detect these cases and reject them);
...
But, yeah, cases where the predication results in 100+ instructions, or tries to do something invalid (such as trying to call a runtime support function), are undesirable.
But, the logic here was a little lax in some of its validity checks.
Where operators are potentially very different depending on which types they are applied to, and in the AST we don't implicitly know the type (the compiler can try to figure it out, but it is not guaranteed).
Though, the predication handling can deal with the "What is the type of this expression? Hell if I know." cases by not predicating the blocks. FWIW, earlier on, BGBCC did not know any types at the AST level, and left type handling up to the RIL->3AC translation stage to sort this out (or, in difficult cases, the type might decay to "variant" and it could be sorted out at runtime using tagrefs; but I have since moved to full static typing, nevermind potential type system holes and similar in the frontend stages).
But, there is a performance cost, as any time one asks the type of an expression, it needs to walk this part of the AST and potentially lookup variables and similar, which isn't free.
But, some of this, while it works OK for C, would not fly so well with C++ (which would not be quite so tolerant of "Swiss cheese" typesystem semantics in the frontend stages). Did work better for some of my own language designs though (mostly built around hybrid static/dynamic systems; some things are a little easier when one can justify dealing with them by falling back to type-tagging and run-time type-checks, ...).
Something like your style of predication mechanism would likely need to be applied much later in the process (such as when applying branch fixups), rather than in the front-end and encoded via the IR stages.
Granted, with my scheme, trying to turn a branch into predication would leave a "hole" in the place where the branch op was (where a NOP or similar may need to be left), in which case, something like a "PRED" instruction wouldn't be a huge loss.
Though, I guess, logic for "hey, turn this short forward branch into predication" could be a possibility as well (and might hit some cases missed by the AST level predication logic, or which appeared spontaneously).
Would require logic to detect short forward branches and confirm that the instruction sequence following the branch does not contain any labels or similar though.
>
All I had to do was to make the second predication overwrite the first
predication's mask, and the compiler did the rest.
Not so simple in my case, but the hardware is simpler, since it just cares about the state of 1 bit (which is explicitly saved/restored along with the rest of the status-register if an interrupt occurs).
Simpler than 8-flip flops used as a shift right register ??
It would need to be captured and preserved correctly during interrupts (and remain correctly aligned with the instruction stream when returning from an interrupt, etc).
This would make it fundamentally different, say, than the shift-register type logic used to deal with a pipeline flush during a branch (where one really does not care what happens if an interrupt happens here; only that the flushed stages do not execute).
Granted, in my case, one does need logic to detect, say, if ID2 contains a predicated instruction, and EX1 contains an instruction that will update the SR.T flag.
Logic for predication handling is otherwise something like:
casez( { opBraFlush, opUCmd[7:6], regInSr[0] } )
4'b000z: tOpEnable = 1; //always
4'b001z: tOpEnable = 0; //never (internal use by CPU)
4'b0100: tOpEnable = 0; //?T and F
4'b0101: tOpEnable = 1; //?T and T
4'b0110: tOpEnable = 1; //?F and F
4'b0111: tOpEnable = 0; //?F and T
4'b1zzz: tOpEnable = 0; //flushed stage
endcase
Where, if tOpEnable is 0, the opcode (in opUCmd[5:0]) is replaced with NOP (unless it is a branch and !opBraFlush), where it is replaced with a "NO_BRANCH" (where if the branch-predictor predicted a branch, it initiates a branch to the following instruction; unlike "BRA" which initiates a branch to the target if the branch-predictor did not predict a branch having been taken).
Where, generally, the branch-flush mask is updated on the transition to the following cycle (and the flush flag corresponding to the IF stage may also be set by the branch-predictor if it predicts a branch having been taken).
...