On 3/1/2025 7:02 PM, MitchAlsup1 wrote:
On Sat, 1 Mar 2025 22:29:27 +0000, BGB wrote:
On 3/1/2025 5:58 AM, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
------------------------------
Would likely need some new internal operators to deal with bit-array
operations and similar, with bit-ranges allowed as a pseudo-value type
(may exist in constant expressions but will not necessarily exist as an
actual value type at runtime).
Say:
val[63:32]
Has the (63:32) as a BitRange type, which then has special semantics
when used as an array index on an integer type, ...
Mc 88K and My 66000 both have bit-vector operations.
OK.
I didn't previously.
But, use-cases have started to appear.
The previous idea for bitfield extract/insert had turned into a
composite BITMOV instruction that could potentially do both operations
in a single instruction (along with moving a bitfield directly between
two instructions).
Using CARRY and extract + insert, one can extract a field spanning
a doubleword and then insert it into another pair of doublewords.
1 pseudo-instruction, 2 actual instructions.
Idea here is that it may do, essentially a combination of a shift and a
masked bit-select, say:
Low 8 bits of immediate encode a shift in the usual format:
Signed 8-bit shift amount, negative is right shift.
High bits give a pair of bit-offsets used to compose a bit-mask.
These will MUX between the shifted value and another input value.
You want the offset (a 6-bit number) and the size (another 6-bit number)
in order to identify the field in question.
It is 8 bits partly because this is what the existing shifter uses.
This can also deal with up to 128 bits (-128 .. 127).
Don't necessarily want to have different encodings for 64 and 128 bit variants.
It would represent a shift of (DestOffset-SrcOffset), where:
For insert it will be positive, for extract, negative.
Keeping this part as-is means that the operation doesn't need to fundamentally change the behavior of the SHAD unit (it will just do the shift as normal).
If I used bare 6 bit fields:
I couldn't do both extract and insert using the same operation;
It couldn't directly perform 128-bit extract or insert, which needs at least 7 bits.
Granted, full 8 bit for all the fields is possibly overkill.
Though, this leaves possibly 8+7+7.
Or, 7+6+6 is limiting to 64-bits only.
But, would need to special-case the handling, as Bit(7) is effectively used as the shift direction:
00..3F: Left, 0..63 bits
40..7F: Also Left, 0..63 bits.
80..BF: Right, 64..1 bits.
C0..FF: Also Right, 64..1 bits.
Except for 128 bit:
00..7F: Left, 0..127 bits
80..FF: Right, 128..1 bits.
And, 32 bits:
00..1F: Left, 0..31 bits
20..3F: Also Left, 0..31 bits
...
For the right shift operators, the sign is inverted in hardware (these existed initially mostly to save a need to negate the input for variable right shift).
For RISC-V mode, it still uses this behavior, but generally code doesn't notice (a more strict interpretation of the RV spec would require masking off Bit(7) for the shift amount, such that giving them negative amounts wouldn't flip the shift direction).
Though, AFAIK, my existing behavior is closer to the original PDP/VAX shift operators...
As-is, decoding rules would have:
JumboImm+3RI: Gives Imm33s with XG3, imm29s with XG1/XG2
JumboOp +3RI: Gives 4RI Imm11, or 3RI Imm17s
One consideration was to special-case SHLR.L and similar, such that JumboImm+SHLR could instead encode:
BITMOV Rs, Rp, Rn, Imm24
With SHLR.Q encoding a 128-bit BITMOVX.
But, debatable if this would be "actually a good idea".
I am still not sure whether this would make sense in hardware, but is
not entirely implausible to implement in the Verilog.
In the extract case, you have the shifter before the masker
In the insert case, you have the masker before the shifter
followed by a merge (OR). Both maskers use the size. Offset
goes only to the shifter.
I was thinking:
tmp=Rm<<Ro
mask=MASKGEN(H, L)
Rn=(Rp&(~mask))|(tmp&mask);
With a singed shift amount, this can do both insert and extract with the same logic.
Though, extract will require feeding a 0 into Rp.
MASKGEN(H, L):
H>L:
((1<<H)-1) & (~((1<<L)-1))
H<=L:
((1<<H)-1) | (~((1<<L)-1))
The H<=L could encode some other cases that don't directly correlate to bitfields, such as shifting most of the bits left or right but then inserting something non-moving in between (possibly from a different bitfield).
Generating H dynamically as L+W was considered, and could possibly save bits, but would increase the cost of the mask generation logic in this case.
Would likely be a 2 or 3 cycle operation, say:
EX1: Do a Shift and Mask Generation;
May reuse the normal SHAD unit for the shift;
Mask-Gen will be specialized logic;
EX2:
Do the MUX.
EX3:
Present MUX result as output (passed over from EX2).
I have done these in 1 cycle ...
This is pushing it though.
As-is, Shift is a 2 cycle operation. Mostly to keep timing from being tight.
Granted, I have noted that at present, timing is made a lot tighter (throughout most of the core), due to the SIMD unit.
If I disable the SIMD unit, I am suddenly left with around 2.5ns of slack... (vs otherwise sitting at around 0.4ns of slack).
But, then, OpenGL is slower (as, without the SIMD unit, FPU SIMD operators go from pipelined 3 cycles to stalling 10 cycles; but does save around 2 kLUT).
But, thinking, since the MUX is only a single level of LUTs, could probably fit in onto the end of the shift stage without too much issue.
The MaskGen operation can be done with actually lower latency than a shift, as although it looks like a shift on paper, it can be implemented more cheaply.