On 3/6/2025 10:09 PM, Lawrence D'Oliveiro wrote:
On Fri, 7 Mar 2025 02:27:59 -0000 (UTC), Waldek Hebisch wrote:
VAX intstructions are very complex and much of that complexity is hard
to use in compilers.
A lot of them mapped directly to common high-level operations. E.g. MOVC3/
MOVC5 for string copying, and of course POLYx for direct evaluation of
polynomial functions.
In a way, one could say that, in many ways, VAX machine language was a
higher-level language than Fortran.
In a lot of this, it is almost a mystery if one could get something similar with a simpler/cheaper ISA by having a pseudo-instruction that does, say:
Move two register inputs to fixed special registers;
Branch to a location within a table supplied in another SPR.
Say:
RTTCALL 27, R18, R20
Which does the functional equivalent of, say, in RV terms:
MV X10, X18
MV X11, X20
JALR X1, 108(RTT)
Where RTT is a dedicated "Runtime Table" register (that is hard-coded in the decoder logic).
...
In other news, got around to implementing the BITMOV logic in BGBCC and similar, and implementing support for Verilog style notation.
So, writing things like:
y[55:48]=x[19:12];
Is now technically possible in my C variant, and (if the BITMOV instruction is enabled) be encoded in a single instruction.
Where:
y[55:48]=x[19:12];
Is 1 instruction, main requirement is that the source and destination bitfield are the same width (widening will require multiple ops).
If the instruction is not enabled, the fallback path is 4 to 6 instructions (in BJX2), or around 12 to 16 instructions for RV64G. I decided to also add support for BITMOV to the RV decoder via my jumbo-extension encodings (though, with some limitations on the 128-bit case due to it needing to fit into a 21 bit immediate).
And:
j=x[19:12];
Also a single instruction, or 2 or 3 in the fallback case (encoded as a shift and mask).
It is possible I could add something to infer a BITMOV from "(x>>SHL)&MASK", though inferring bitfield insert is likely to be too difficult for now.
At present, XG3 and RV will have a slight advantage over XG1 and XG2 for this instruction in that they have access to a zero register, which as-is doing a bitfield extract via BITMOV will require having access to 0. I am half tempted to consider whether it may make sense to special-case the decoding to interpret R1 on the Ro/Rp input as meaning ZZR (well, or just live with the compiler needing to load 0 into a register to use it as a bitfield extract, albeit slightly diminishing its advantage over shift+mask in this case; though, statistically, 0 is likely to have already been loaded into a register either way).
For a simple test:
lj[ 7: 0]=li[31:24];
lj[15: 8]=li[23:16];
lj[23:16]=li[15: 8];
lj[31:24]=li[ 7: 0];
Does seem to compile down to 4 instructions.
Though, looking at the compiler code, it would be subject to the "side effects in lvalue may be applied twice" bug:
(*ct++)[19:12]=(*cs++)[15:8];
Will result in the 'ct++' part happening twice (this is a pesky bug that is prone to reappear). Though, a lazy option could be, since this is a weird extension anyways, to simply claim it as undefined if the lvalue expression has side effects.
Maybe will add, say:
(_BitInt(N)) { x, y, z };
Some uncertainties remain for how to approach adding VL support to BGBCC:
How to best approach instantiating modules;
How to best deal with "always" blocks;
High-level control flow is very different in Verilog vs C;
How to best deal with 'case' and 'casez'.
They don't exactly match up 1:1 with a C switch block.
It may make sense to special-case some "case()" blocks into lookup tables. In small cases, "casez" could be turned into a lookup table or a generic switch, but other cases might require linear pattern matching, which would be slow...
I have some ideas for special cases, not no idea for a good "generic case" strategy that would be much faster than O(n) ...
Best I can come up with is:
Walk all patterns, generate a mask of used bits;
Generate a hash table of (bits&mask) pointing to a "guess" entry;
Iterating over every possible bit-pattern.
Only hitting for the first matching entry.
If guessed match fails, it falls back to a linear search.
Landing on the first possible match and then doing a linear search for subsequent matches has usually worked in the past. Worst case is still a linear search though. Since casez patterns are formally not allowed to overlap, it should be possible to sort them into ascending order (which may reduce the probability of adverse cases) but is debatable.
Either way, it is a lot more complicated than a normal "switch()".
For module instantiation:
Probably the module itself is compiled with each "always @*" block or similar as a function;
Any "assign" statements can probably also be lumped together;
The module instance is itself turned into an object instance, with all of its registers as object fields;
The blocks are then essentially virtual methods.
The most naive way to do this:
Each module has an implicit "Run()" methods, which runs all of the always blocks, and selectively runs "posedge" or "negedge" blocks if the selected clock has changed (afterwards, the new clock is assigned to a variable holding the last-known value, so that the posedge/negedge only triggers when the value changes).
But, I am starting to have doubts as to the sanity of trying to add support for Verilog to BGBCC...