On 10/16/2024 1:07 PM, Stephen Fuld wrote:
On 10/16/2024 8:59 AM, Paul A. Clayton wrote:
snip
I do not know of any enumeration of conditions that would be
commonly useful. Less than, equal to, greater than might be
somewhat useful for a three-way branch.
That was the function of the arithmetic if statement in original Fortran. If it were more useful, it wouldn't have been taken out of the language long ago.
Yeah...
Ironically, one of the main arguable use-cases for old Fortran style IF statements is implementing the binary dispatch logic in a binary subdivided "switch()", but not enough to justify having a dedicated instruction for it.
Say:
MOV Imm, Rt //pivot case
BLT Rt, Rx, .lbl_lo
BGT Rt, Rx, .lbl_hi
BRA .lbl_case
But, absent having multiple labels per branch, not really a good way to save much over this...
Otherwise, had recently been still working on BGBCC+RV stuff:
Trying to getting stuff working correctly in my Verilog implementation.
There are still some bugs here.
Writing a spec for a "low-cost" FPU SIMD extension:
https://pastebin.com/9UeAP9YkWhich basically just takes the arguably cheaper route of "extend the F, D, and Zfh extensions to support basic FPU-SIMD in the existing FPRs" rather than "define a whole new complicated mess of stuff" that is the V extension.
Some details are still in-flux, and I have not yet decided whether or not to map over the FP8 converter ops and similar. Arguably FP8 and A-Law converter ops are a bit niche though.
As well as looking some at the P spec, which (ignoring the needlessly complicated parts) isn't too far from what BJX2 does SIMD wise (albeit lacks obvious direct equivalents of the RGB555 helper instructions; but possibly using SIMD to work with RGB555 pixel data is a bit niche).
It is possible if I add some of this, I may do it as jumbo-prefix-only ops. One is unlikely to see RGB555 or FP8 converters used in any significant density (except maybe if doing highly-unrolled NN code using FP8 or similar; but unclear if it would try to make sense to map this over to RV anyways; and existing people trying to do stuff in this area appear to be mostly focused on the V extension).
For normal graphical or audio processing, having these sorts of niche converters as 64-bit encodings would probably be fine.
As-is, it could do a 4x32 shuffle in 2 instructions, but would need either a 4-op sequence (no jumbo), or a jumbo-encoded op, to perform a 4x16 shuffle (it is either this or define a dedicated "FPSHUF.H" instruction or similar). Can probably assume, if it matters, will probably also have a jumbo prefix.
May still need to decide on some other things, like whether to map over a jumbo-encoded 4xFP8 to 4xFP16 constant-load. Or, whether to come up with an encoding to load an arbitrary 64-bit value into an FPR (currently N/E in RV64 mode).
As-is:
J22+J22+LUI : LI Xn, Imm64
J22+J22+AUIPC: Unused
J22+J22+JAL : Unused, Possible "JAL Rn, Abs64"
For FPR's, in may make sense to have:
Load Binary16, expanding to Binary64 (already in Jumbo spec)
Load Imm33s into low-order bits (Jumbo spec, J12O+LUI)
Load Imm32 into high-order bits
Possible, not yet defined, already exists in BJX2 (1).
Load Imm32 as 2xFP16 expanding to 2xFP32
Possible, not yet defined, already exists in BJX2 (1).
Load Imm32 as 4xFP8 expanding to 4xFP16
Possible, not yet defined, already exists in BJX2 (1).
*1: Probably could define it as J12O+LUI, using the Wm and Wo register-extension bits to encode which type of constant to load (basically about the same as how I did it in BJX2; just it had used a J_OP and "MOV Imm16u, Rn" instruction instead, but similar basic idea here).
Probably, say:
00: Load Imm33s to low 32-bits, sign-extend as usual
01: Load Imm32 to high 32-bits (sign bit used for LSB fill, *2)
10: 2xFP16 -> 2xFP32
11: 4xFP8 -> 4xFP16
*2: Though in BJX2, this case was encoded as J_IMM+"LDIHI Imm10, Rn"
Could maybe be tempted to reclaim "J22+J22+AUIPC" as:
LI Fn, Imm64
Arguing that, if one needs PC-rel, +/- 4GB is sufficient; and one is far more likely to want to be able to load constants like M_PI and similar into an FPU register (in a single clock-cycle).
Though, if one has this, the other constant cases (2xFP16 or 4xFP8) would be merely space-saving (mostly relevant to FP-SIMD vector literals), but may be lower priority mostly as they are infrequently used (and thus the space savings are less significant).
Relative cost-difference is small, if one assumes an implementation where the constant-load cases use the same converters as used for the normal vector conversion path, which would be (presumably) already present.
Most of this would be largely irrelevant to Doom performance, but would be relevant if I want to try to make GLQuake work at some semblance of usable in RV Mode.
Less immediate relevance to SW Quake, which uses mostly scalar FPU (and mostly naively represents vectors as in-memory pointers).
In this stuff, I have also started running into annoyance of noting differences and additions/removals/changed between different versions of the BitManip spec / B extension. A few useful ops were removed in newer versions, ...
My Jumbo prefix encoding would have conflicted with an earlier version of BitManip, but does not conflict with the current form of the B extension (it exists in the shadow of previously-removed instructions).
Felt curious and looked, it looks like the person mostly responsible for the B extension has largely "gone quiet" for the past year or so (no recent social media posts, has seemingly taken down all of their past YouTube and Twitch contents; minimal activity on GitHub). Not entirely sure what is going on there.
...
Otherwise, did see a video talking some about performance of Doom and Quake and similar on older systems:
Doom apparently required something like a 486 DX2-66 to perform well.
Quake apparently required a faster Pentium system to be playable.
Apparently, likewise for Hexen, ...
Apparently Wolf3D needed a higher-end 386 to perform well.
Even if it could technically run on a 286.
...
I guess this differs from my prior understanding that Doom would have been mostly playable on a 25 MHz 386 or similar. Apparently, not really.
So, I guess I can feel not quite as bad about the lackluster framerates from Quake and Hexen on a 50MHz core. Seemingly, it is in-fact still outperforming vintage (early 90s) PCs.
Well, and Quake3 is pretty slow, but IIRC, PCs of that era were generally pushing 1GHz, so...
...