On 12/19/2024 5:27 AM, bart wrote:
On 19/12/2024 05:46, BGB wrote:
On 12/18/2024 6:35 PM, bart wrote:
On 19/12/2024 00:27, BGB wrote:
By-Value Structs smaller than 16 bytes are passed as-if they were a 64 or 128 bit integer type (as a single register or as a register pair, with a layout matching their in-memory representation).
>
...
>
>
But, yeah, at the IL level, one could potentially eliminate structs and arrays as a separate construct, and instead have bare pointers and a generic "reserve a blob of bytes in the frame and initialize this pointer to point to it" operator (with the business end of this operator happening in the function prolog).
>
The problem with this, that I mentioned elsewhere, is how well it would work with SYS V ABI, since the rules for structs are complex, and apparently recursive.
>
Having just a block of bytes might not be enough.
>
In my case, I am not bothering with the SysV style ABI's (well, along with there not being any x86 or x86-64 target...).
I'd imagine it's worse with ARM targets as there are so many more registers to try and deconstruct structs into.
Not messed much with the ARM64 ABI or similar, but I will draw the line in the sand somewhere.
Struct passing/return is enough of an edge case that one can just sort of declare it "no go" between compilers with "mostly but not strictly compatible" ABIs.
>
For my ISA, it is a custom ABI, but follows mostly similar rules to some of the other "Microsoft style" ABIs (where, I have noted that across multiple targets, MS tools have tended to use similar ABI designs).
When you do your own thing, it's easy.
In the 1980s, I didn't need to worry about call conventions used for other software, since there /was/ no other software! I had to write everything, save for the odd calls to DOS which used some form of SYSCALL.
Then, arrays and structs were actually passed and returned by value (not via hidden references), by copying the data to and from the stack.
However, I don't recall ever using the feature, as I considered it efficient. I always used explicit references in my code.
Most of the time, one is passing/returning structures as pointers, and not by value.
By value structures are usually small.
When a structure is not small, it is both simpler to implement, and usually faster, to internally pass it by reference.
If you pass a large structure to a function by value, via an on-stack copy, and the function assigns it to another location (say, a global variable):
Pass by reference: Only a single copy operation is needed;
Pass by value on-stack: At least two copy operations are needed.
One also needs to reserve enough space in the function arguments list to hold any structures passed, which could be bad if they are potentially large.
But, on my ISA, ABI is sort of like:
R4 ..R7 : Arg0 ..Arg3
R20..R23: Arg4 ..Arg7
R36..R39: Arg8 ..Arg11 (optional)
R52..R55: Arg12..Arg15 (optional)
Return Value:
R2, R3:R2 (128 bit)
R2 is also used to pass in the return value pointer.
'this':
Generally passed in either R3 or R18, depending on ABI variant.
Where, callee-save:
R8 ..R14, R24..R31,
R40..R47, R56..R63
R15=SP
Non-saved scratch:
R2 ..R7 , R16..R23,
R32..R39, R48..R55
Arguments beyond the first 8/16 register arguments are passed on stack. In this case, a spill space for the first 8/16 arguments (64 or 128 bytes) is provided on stack before the first non-register argument.
If the function accepts a fixed number of arguments and the number of argument registers is 8 or less, spill space need only be provided for the first 8 arguments (calling vararg functions will always reserve space for 16 registers in the 16-register ABI). This spill space effectively belongs to the callee rather than the caller.
Structures (by value):
1.. 8 bytes: Passed in a single register
9..16 bytes: Passed in a pair, padded to the next even pair
17+: Pass as a reference.
Things like 128-bit types are also passed/returned in register pairs.
Contrast, RV ABI:
X10..X17 are used for arguments;
No spill space is provided;
...
My variant uses similar rules to my own ABI for passing/returning structures, with:
X28, structure return pointer
X29, 'this'
Normal return values go into X10 or X11:X10.
Note that in both ABI's, passing 'this' in a register would mean that class instances and COM objects are not equivalent (COM object methods always pass 'this' as the first argument).
The 'this' register is implicitly also used by lambdas to pass in the pointer to the captured bindings area (which mostly resembles a structure containing each variable captured by the lambda).
Can note though that in this case, capturing a binding by reference means the lambda is limited to automatic lifetime (non-automatic lambdas may only capture by value). In this case, capture by value is the default.
For my compiler targeting RISC-V, it uses a variation of RV's ABI rules.
Argument passing is basically similar, but struct pass/return is different; and it passes floating-point values in GPRs (and, in my own ISA, all floating-point values use GPRs, as there are no FPU registers; though FPU registers do exist for RISC-V).
Supporting C's variadic functions, which is needed for many languages when calling C across an FFI, usually requires different rules. On Win64 ABI for example, by passing low variadic arguments in both GPRs and FPU registers.
I simplified things by assuming only GPRs are used.
/Implementing/ variadic functions (which only occurs if implementing C) is another headache if it has to work with the ABI (which can be assumed for a non-static function).
I barely have a working solution for Win64 ABI, which needs to be done via stdarg.h, but wouldn't have a clue how to do it for SYS V.
(Even Win64 has problems, as it assumes a downward-growing stack; in my IL interpreter, the stack grows upwards!)
Most targets use a downward growing stack.
Mine is no exception here...
Not likely a huge issue as one is unlikely to use ELF and PE/COFF in the same program.
>
>
For the "OS" that runs on my CPU core, it is natively using PE/COFF, but
That's interesting: you deliberately used one of the most complex file formats around, when you could have devised your own?
For what I wanted, I would have mostly needed to recreate most of the same functionality as PE/COFF anyways.
When one considers the entire loading process (including DLLs/SOs), then PE/COFF loading is actually simpler than ELF loading (ELF subjects the loader to needing to deal with symbol and relocation tables), similar to PIE loading.
Things like the MZ stub are optional in my case, and mostly ignored if present (in my LZ compressed PE variants, the MZ stub is omitted entirely).
I had at one point considered doing a custom format resembling LZ compressed MachO, but ended up not bothering, as it wouldn't have really saved anything over LZ compressed PE/COFF.
Some "unneeded cruft" like the Resource Section was discarded, mostly replaced by an embedded WAD2 image. The header was modified some to allow for backwards compatibility with the Windows format (mostly creating a dummy header in the original format that points to the WAD2 directory).
Idea is that icons, bitmaps, and other things, would mostly be held in WAD lumps. Though, resources which may be accessed via symbols in the EXE/DLL need to be stored uncompressed (where "__rsrc_lumpname" may be used to access the contents of resource-section lumps as an extern symbol).
Say, for example:
extern byte __rsrc_mybitmap[]; //resolves to a DIB/BMP or similar
For now, resource formats:
Images:
BMP (various settings)
4, 8, and 16 bpp typical
Supports a non-standard 16-bpp alpha-blended mode (*1).
Supports non-standard 16 color and 256 color with transparent.
Supports CRAM BMP as well (2 bpp)
QOI (assumes RGBA32, nominally lossless)
QOI is a semi-simplistic non-entropy-coded format.
Can give PNG-like compression in some cases.
Reasonably fast/cheap to decode.
LCIF, custom lossy format, color-cell compression.
OK Q/bpp but mostly only on the low-end.
Resembles a QOI+CRAM hybrid.
UPIC, lossy or lossless, JPEG-like (*2)
*1:
0rrrrrgggggbbbbb Normal/Opaque
1rrrraggggabbbba With 3 bit alpha (4b/ch RGB).
For 16 and 256 color, a variant is supported with a transparent color.
Generally the high intensity magenta is reused as the transparent color. This is encoded in the color palette (if all colors apart from one have the alpha bits set to FF, and one color has 00, then that color is assumed to be a transparent color).
CRAM bpp: Uses a limited form of the 8-bit CRAM format:
16 bits, 4x4 pixels, 1 bit per pixel
2x 8 bits: Color Endpoints
The rest of the format being unsupported, so it can simply assume a fixed 32-bits per 4x4 pixel cell.
*2: The UPIC format is structurally similar to JPEG, but:
Uses TLV packaging (vs FF-escape tagging);
Uses Rice coding (vs Huffman)
Uses Z3.V5 VLC, vs Z4.V4
Uses Block-Haar and RCT
Vs DCT and YCbCr.
Supports an alpha channel.
Y 1 (*2A)
YA 1:1 (*2A)
YUV 4:2:0
YUV 4:4:4 (*2A)
YUVA 4:2:0:4
YUVA 4:4:4:4 (*2A)
*2A: May be used in the lossless modes, depending on image.
VLC coding resembles Deflate's natch distance encoding, with sign-folded values. Runs of zero coefficients have a shorter limit, but similar. Like with JPEG, an 0x00 symbol encodes an early EOB.
In tests, on my main PC:
Vs JPEG: It is a little faster
Q/bpp is similar, better/worse depends on image.
Slightly worse on photos, but "similar".
Generally somewhat better on artificial images.
Vs PNG:
Faster to decode (with less memory overhead);
Better compression on many images (particularly photo-like).
Note that UPIC was designed to not require any large intermediate buffers, so will decode directly to an RGB555 or RGBA32 output buffer (decoding happens in terms of individual 16x16 pixel macroblocks).
It was designed to be moderately fast and to try to minimize memory overhead for decoding (vs either PNG or JPEG, which need a more significant chunk of working memory to decode).
Block-Haar is a Haar transform made to fit the same 8x8 pixel blocks as DCT, where Haar maps (A,B)->(C,D):
C=(A+B)/2 (*: X/2 here being defined as (X>>1))
D=A-B
But, can be reversed exactly, IIRC:
B=C-(D/2)
A=B+D
By doing multiple stages of Haar transform, one can build an 8-pixel version, and then use horizontal and vertical transforms for an 8x8 block. It is computationally fairly cheap, and lossless.
The Walsh-Hadamard transform can give similar properties, but generally involves a few extra steps that make it more computationally expensive.
It is possible to use a lifting transform to make a Reversible DCT, but it is slow...
BGBCC accepts JPEG and PNG for input and can convert them to BMP/QOI/UPIC as needed.
For audio storage, generally using the RIFF WAV format. For bulk audio, both A-Law and IMA ADPCM work OK. Granted, IMA ADPCM is not space efficient for stereo, but mostly OK for mono (most common use-case for sound effects).
I did exactly that at a period when my generated DLLs were buggy for some reason (it turned out to be two reasons). I created a simple dynamic library format of my own. Then I found the same format worked also for executables.
But I needed a loader program to run them, as Windows obviously didn't understand the format. Such a program can be written in 800 lines of C, and can dynamically libraries in both my format, and proper DLLs (not the buggy ones I generated!).
A hello-world program is under 300 bytes compared with 2 or
2.5KB of EXE. And the format is portable to Linux, so no need to generate ELF (but I haven't tried). Plus the format might be transparent to AV software (haven't tried that either).
OK.
By design, my PEL format (PE+LZ) isn't going to get under 2K (1K for headers, 1K for LZ'ed sections).
But, usually this is not a problem.