On 12/18/2024 6:08 AM, bart wrote:
On 17/12/2024 18:51, BGB wrote:
On 12/17/2024 6:04 AM, bart wrote:
C can apparently compile to WASM via Clang, so I tried this program:
>
void F(void) {
int i=0;
while (i<10000) ++i;
}
>
which compiled to 128 lines of WASM (technically, some form of 'WAT', as WASM is a binary format). The 60 lines correspondoing to F are shown below, and below that, is my own stack IL code.
I'm not even sure what format that code is in, as WAT is supposed to use S-expressions. The generated code is flat. It differs in other ways from examples of WAT.
Dunno there...
It looks like WASM has changed slightly from what I remember when I originally looked at it, so it could be "possible" if it could be made to support separate compilation and similar.
Hmm... It looks like the WASM example is already trying to follow SSA rules, then mapped to a stack IL... Not necessarily the best way to do it IMO.
I hadn't considered that SSA could be represented in stack form.
But couldn't each push be converted to an assignment to a fresh variable, and the same with pop?
As for Phi functions, the only similar thing I encounter (but could be mistaken), is when there is a choice of paths to yield a value (such as (c ? a : b) in C; my language has several such constructs).
I was mostly noting that it appeared that every operation was creating a new variable and only assigning to it once.
I didn't look too much more closely than this, only to note that it was different.
With stack code, the result conveniently ends up on top of the stack whichever path is taken, which is a big advantage. Unless you then have to convert that to register code, and need to ensure the values end up in the same register when the control paths join up again.
With JVM, the rule was that all paths landing at the same label need to have the same stack depth and same types.
With .NET, the rule was that the stack was always empty, any merging would need to be done using variables.
BGBCC is sorta mixed:
In most cases, it follows the .NET rule;
A special-case exception exists mostly for implementing the ?: operation (which in turn has special stack operations to signal its use).
BEGINU // start a ?: operator
L0:
... //one case
SETU
JMP L2
L1:
... //other case
SETU
JMP L2
ENDU
L2:
This is a bit of wonk, if I were designing it now, would likely do it the same as .NET, and use temporary variables.
Actually, I might be tempted to use a 3AC IR as well (though, probably non-SSA). And, probably design things a bit differently.
In this case, if I did a 3AC IR, might design a textual syntax along similar lines to BASIC or FORTRAN 77 (albeit probably without the fixed-column formatting or line numbers).
Though, the nominal format for use in the compiler would remain binary.
>
But, yeah, in BGBCC I am also using a stack-based IL (RIL), which follows rules more in a similar category to .NET CIL (in that, stack items carry type, and the stack is generally fully emptied on branch).
>
>
In my IL, labels are identified with a LABEL opcode (with an immediate), and things like branches work by having the branch target and label having the same immediate (label ID).
So, you jump to label L123, and the label looks like:
L123:
Yeah, in textual form.
Though, the label is internally represented as, say:
LABEL 123
IIRC, usually numbering starts over from 0 for each function, though in the backend IR all labels get a unique number within a 24-bit numbering space.
The labels are then split into several categories:
Global labels, used to identify functions/variables, with an associated name;
IL labels, which were mapped over from the RIL bytecode;
Temporary labels, which exist solely in the backend;
Line numbers, not true labels, mostly exist to convey line-number info (associated with a file-name and line number);
Special/Architectural, used as placeholders for things like CPU registers (for variable load/store).
I think that is pretty standard! But it sounds like you use a very tight encoding for bytecode, while mine uses a 32-byte descriptor for each IL instruction.
(One quibble with labels is whether a label definition occupies an actual IL instruction. With my IL used as a backend for static languages, it does. And there can be clusters of labels at the same spot.
With dynamic bytecode designed for interpretation, it doesn't. It uses a different structure. This means labels don't need to be 'executed' when encountered.)
In my interpreters, it always uses a bytecode operation.
However, apart from my very early interpreters, typically the stack IL is not used directly.
So, a personal timeline was like:
2003/2004: BGBScript came into existence
First version used DOM and directly walked the DOM tree.
Used a GC, generated lots of garbage objects;
Syntax was based on JavaScript with some wonk;
Was horridly slow.
2006:
BGBScript VM (BS-VM) was rewritten to S-Expressions internally;
Dropped some of the original wonk, moving to a cleaner JS syntax;
Went to a bytecode interpreter.
2007:
BGBCC was written using the frontend from the 2003 VM as a base;
The IL design was based on 2006 BS-VM;
Replaced the original DOM with a custom stand-in;
Used parts of the 2006 VM as well.
2009:
The BS-VM was modified to turn the stack IL into 3AC and run this;
Also had a JIT and similar by this point;
Using 3AC and JIT made things significantly faster;
Also tended to leak a lot less garbage,
operating mostly at "steady state".
Syntactically, it had become more like ActionScript3 or HaXE.
2013: Created BGBScript2 (BS2)
This mostly resembled a Java/C#/AS3 hybrid;
Eliminated the GC in favor of primarily static + manual MM.
2015/2016: Created the BGBTech2 3D engine
Partly written in a mix of C and BGBScript2
Was my biggest project to use BS2
Then:
2017: Started on my BJX1 project
Revived BGBCC, used it as the compiler.
2019: Rebooted the project to BJX2.
BJX1 quickly turned into a huge mess
which was non-viable to implement in an FPGA.
Until now, BJX2 project has continued.
Some stuff following the design of the BS2 VM was back-ported onto BGBCC, but in many ways, BGBCC has a lot more cruft.
In the BS2 VM, the image format is a TLV container.
There is a string table, data area for functions/etc;
Index tables;
...
Generally, functions could be loaded and converted to 3AC on demand.
The IL in the BS2 VM was not a pure stack machine, but more like:
OP with 2 stack args, stack dest (common with BGBCC)
OP with 2 stack args, local dest (common with BGBCC)
OP with 2 local args, stack dest
OP with 2 local args, local dest (like in 3AC)
OP with local and immediate, stack dest
OP with local and immediate, local dest
OP with local and stack, stack dest
OP with local and stack, local dest
This was more complicated, but reduced the number of IL operations. Internally, it all converted to 3AC for the backend interpreter.
The incentive to do this for BGBCC was less, as folding the local-variable or constant-loads into the operator is less immediately beneficial to a compiler; but does make the bytecode loader more complicated. Folding the destination register into the bytecode ops in many cases is still relevant, as it is comparably harder to fold the destination-store into the 3AC op than to fold a source load.
Generally, bytecode ops and operands were encoded with VLNs (variable length numbers).
Generally (numberic VLN):
00..7F: 0..127
00..BF XX: 128..16383
C0..DF XX XX: 16384..2M
...
These values were encoded in MSB first order, and could directly represent values up to 64 bits (in both the BS2VM and BGBCC, 128-bit values tend to be represented as pairs of 64-bit values).
For signed integer values, the sign was folded into the LSB.
Floating point values were represented as a base/exponent VLN pair.
Basically, an integer value scaled by a power-of-2 exponent.
Opcodes were different, IIRC:
00..DF: Single Byte
E0..EF: Two Byte (224..4095)
F0..F7: Three Byte
...
But, generally, only 1 and 2 byte cases were used.
IIRC, did not define a textual notation for the BS2VM's ASM.
Local variables, labels, etc, were all identified as numeric indices.
Typically a single byte.
Like JVM, and unlike BGBCC, in the BS2VM, all the variables (including arguments) were held in an array of local variables (BGBCC has locals, arguments, and temporaries, as 3 separate spaces).
IIRC, BS2VM had still used variable type-tagging (like BGBCC and .NET), rather than the untyped variables with typed operators scheme (what JVM had used).
But, typed operators more make sense if you intend to interpret the stack bytecode directly, which was generally not done in my VMs (except in very early versions). Otherwise, implicitly typed operators probably make more sense.
...