On 1/23/2025 6:51 PM, bart wrote:
On 23/01/2025 20:58, BGB wrote:
On 1/23/2025 4:54 AM, bart wrote:
On 23/01/2025 01:05, James Kuyper wrote:
On 2025-01-22, bart <bc@freeuk.com> wrote:
Gcc 14.1 gives me an error compiling this code:
>
struct vector;
struct scenet;
>
struct vector {
double x;
double y;
double z;
};
>
struct scenet {
struct vector center;
double radius;
struct scenet (*child)[];
};
>
6.7.6.2p2: "The element type shall not be an incomplete or function type."
>
I have many draft versions of the C standard. n2912.pdf, dated
2022-06-08, says in 6.7.2.1.p3 about struct types that "... the type is
incomplete144) until immediately after the closing brace of the list
defining the content, and complete thereafter."
>
Therefore, struct scenet is not a complete type until the closing brace
of it's declaration.
>
Wouldn't this also be the case here:
>
struct scenet *child;
};
>
The struct is incomplete, but it still knows how to do pointer arithmetic with that member. The calculation is not that different from the array version (actually, the code from my compiler is identical).
>
>
Difference is, in this case, "sizeof(struct scenet)" is not relevant to "sizeof(struct scenet *)".
No, both of these need to know the size of the struct when accessing the i'th element:
....
struct scenet *childp;
struct scenet (*childa)[];
};
The only thing you can't do with x->childa is perform pointer arithmetic on the whole pointer-to-array, since the array size is zero. But doing (x->childa)[i] should be fine.
As is clear since other compilers (excluding those that lavishly copy gcc's behaviour) have no problem with it.
I think it is more a case of formal definitions here...
Formally, with the parenthesis and array, the size of the struct is considered relevant (even if not strictly so), but is also unknown at that point.
>
>
This seems like obscure edge case territory.
It's a 'pointer to array'; it might be uncommon in C (because of its fugly syntax), but it hs hardly obscure!
In my own use, excluding function pointers, I almost never have a need to use parenthesis with declarations.
>
Alas, if I could have my way, I might define a simplified subset which drops some of these sorts of edge cases (the form with parenthesis would simply become disallowed), but, likely, this wouldn't amount to much.
T(*)[] is a perfectly valid type; there is no reason to exclude it from struct members.
It is unambiguous in my original language, and can also be in C.
I have a slight difference of opinion in that, if I were designing C, it would not be allowed.
The merit of C is, in a way, that almost has just what is needed, little more, and little less.
Unlike, say, C++, which went down the rabbit hole of ever-increasing complexity. None the less, it has still gained some complexities beyond the bare minimum, and still has some weak points. Such as lacking a standardized form of vector/SIMD extensions, or any way to have customizable types (though, the latter point risks getting dangerously close to C++ territory, so dunno).
Say, for a language that is:
Mostly backwards compatible with existing C code;
Allows for a smaller and simpler compilers;
Uses some C# like rules to eliminate the need for checking for typedefs to parse stuff.
>
Though, one can't go entirely over to C# like behavior if one still wants to support traditional separate compilation (so one would still have a need for things like function prototypes, header files, and a traditional preprocessor).
>
But, then one would basically just end up with C but with people being confused about why things like "unsigned x;" no longer work (making it kinda moot).
>
>
And, most people continue to swear by GCC and Clang, unconcerned with their multi MLOC codebases, and the overly long time it takes to recompile the compiler from source...
Yeah. I can choose to run my compiler from source each time it is invoked; you barely notice the difference! (It adds 70-80ms.)
This cuts no ice here however.
Partial reason BGBCC still exists:
GCC and Clang are monstrosities (huge and slow to compile);
LCC offered very little over what I had already at the time;
TinyC didn't look like a particularly attractive starting point either.
However, as it is (having expanded significantly over the past some-odd years), it can still be recompiled from source in a few seconds...
Whereas, rebuilding GCC is a good part of an hour, and LLVM+Clang somehow manages to have build times measured in multiple hours (and, the build times for Clang seem to get slower faster than computers are getting faster).
Granted, one can speed it up some by trying to temporarily disable ones' antivirus software, but that one is needed to start caring about things like disabling AV software for faster build times, in the first place, is still a problem...
Granted, my existing compiler is a bit bigger; sadly, its code footprint is more on par with Quake3, and its memory footprint generally a bit steep (well, if one wants to run it on an FPGA board with 128MB of total RAM; ideally one wants to keep the memory footprint needed to compile a moderate size program in under around 50MB or so; which is an epic fail for my compiler as it is...).
>
And, as-is, compiling stuff takes a painfully long time on a 50MHz CPU (even a moderately small program might take several minutes or more).
You can't cross-compile on a PC?
That it what I normally do, but it would be "nice" to have the option to compile stuff natively from within the FPGA soft-processor or emulator.
But, to make this more practical would need a faster and lighter weight compiler than what I have already.
Seemingly big issues:
Parsing an AST for a whole translation unit, eats a lot of RAM;
Decoding stuff into the internal 3AC IR, for a whole program at a time, also eats a lot of RAM.
I had tried to look into designing a compiler with the preprocessor and parser overlaid via a linked-list "line buffer" where, the preprocessor would preprocess lines, put them in a linked list, and the parser would consume them (freeing up each line once all tokens were consumed), and then trying to drive the middle part of the compilation process one top-level declaration at a time.
This turned into more of a mess than I would have hoped.
My existing compiler runs the preprocessor first, and generates a text buffer containing the entire preprocessed output, but this can sometimes reach sizes in MB territory (mostly with all of the stuff pulled in from headers, which will often dwarf the actual code in each translation unit).
Then, the parser is left churning through large numbers of things like structs, typedefs, and function prototypes, before getting to the actual code. Parsing all these into an AST eats time and memory.
While the AST is arguably very bulky, one can at least entirely discard it after each translation unit (this is one use case for a zone allocator; where one can allocate AST related memory in an AST zone and free all of it after each translation unit). The steep up-front cost of the preprocessor output can also be reduced slightly by "chunking" the buffering, say, into multiples of 32kB or similar (as opposed to trying to "malloc()" the whole 1MB or so in a single large buffer).
Ideally, one then wants to leave the IL in a form where the compiler doesn't need to load everything into 3AC form all at once, but my existing IL design left little choice here. It was designed in a purely linear structure with symbols managed by a sort of sliding array with an LZ compression scheme, which means effectively the bytecode needs to be decoded linearly and all at once.
Too many things that eat RAM.
Better would have been a structure where only a high-level view of the metadata need to be decoded up-front (and then possibly in a way that allows a cache-like approach), and similarly allowed for decoding the Stack-IL into 3AC incrementally (say, when we are actually compiling the function in question).
But, it is also a question of how to pull things off in a memory-compact way without re-introducing a lot of the limitations that existed in 1980s era compilers (say, for example, the compiler having no way to know whether or not a given function is reachable within the call graph).
Say, if you decode the entire program into 3AC form all at once, it is possible to do things like walk the entire program as a graph and trace out what functions are reachable (and determine things like local vs external visibility, etc). This sort of a thing would be much less viable if one could only look at a single function at a time.
But, then, if one needs to burn, say, 64 bytes per 3AC operation (and one may have on average several hundred 3AC ops per function, and several thousand functions in a program), RAM cost adds up quickly.
Where, in BGBCC, generally each function would have a dense array of 3AC operator structs, and another array of "traces" which give the starting and ending index of each basic block, and some flags and similar.
Things like 3AC nodes and string tables eating up lots of RAM.
But, the partial result of all of this is a compiler that has an impractical memory footprint for an FPGA based soft processor (and is also impractically slow).
Then again, my compiler is pretty slow even on my main PC. The amount of time it takes being similar to that taken by GCC; which is kinda dead slow if compared with MSVC. Seemingly, MSVC is somehow a very fast compiler, with Clang sort of in-between (slower than MSVC, but still faster than GCC).
Though, for actual compiled program performance, GCC tends to do pretty well, and MSVC often worse. But, for some things, the reverse is true (where the MSVC output is a lot faster than the GCC output).
...
But, as for ISA support on my processor (and supported by BGBCC), there are currently several options:
BJX2 Baseline
Original form of my custom ISA;
Primarily, it is a 32-register design, with 16/32/64/96 bit ops;
XG2:
Newer variant of my ISA;
Drops 16-bit ops, moves over to 6-bit register fields;
Natively uses 64 GPRs;
Has 32/64/96 bit encodings.
RISC-V (RV64G)
Uses 5 bit register fields, with 32 GPRs;
And, another 32 FPU registers.
The CPU supports the 16-bit "C" extension, but BGBCC does not.
With my design, the "C" ops come with a performance penalty.
I have a jumbo-prefix extension that adds 64 and 96 bit encodings.
Largely to improve performance.
It works in essentially the same way as in my own ISA,
and does similar things.
Among a few other custom extensions.
XG3:
Bit-repacked an modified version of my ISA;
Can be "crazy glued" onto RV64G to make a sort of hybrid ISA.
It implicitly "re-merges" the X and F registers,
which were split in RV64G.
But, more just that it goes back to what XG2 did...
Currently, performance:
Plain RV64G is slower than both XG2 and XG3,
including when compiled with "gcc -O3"
Though, GCC is faster than BGBCC when targeting bare RV64G.
BGBCC targeting plain RV64G: Kinda sucks...
If I trick out the ISA, BGBCC is faster than GCC targeting RV64G.
Dunno what would happen if GCC could use my ISA extensions...
XG2 currently holds the speed prize...
XG3 isn't quite as fast at XG2 at present, but has promise.
In theory, XG2 and XG3 should be basically equivalent, as they are (more or less) the same ISA just with the bits shuffled around (mostly this was to allow XG3 to coexist in the same opcode space as RV64G, replacing the "C" extension's encoding space). In the process, I did slightly improve the "aesthetics" of the encoding scheme.
There are some minor differences between them, mostly related to how BGBCC is using the ISA, and the ABI (with XG3, it is using the RISC-V ABI rules).
One thing that does minorly hurt BGBCC here is that it primarily uses callee-save registers for local variables, where:
My native ABI is balanced in favor of slightly more callee save registers than scratch registers;
The RISC-V ABI has more scratch registers than callee save registers;
So, when using the RISC-V ABI, there is more register pressure (and more register spills).
Ironically, at the same time, the RISC-V ABI has less function argument registers vs XG2 (8 vs 16), and lacks argument spill space, which in turn contribute towards making things "slightly less efficient".
Can't really "fix" the ABI for XG3 without causing binary compatibility issues with calls to/from RV64G code, which would defeat the whole point of why XG3 exists.
Having more scratch registers (and fewer callee save) is better for leaf functions, but implicitly assumes that one is spending more of their time in leaf functions (and comparably hurts performance if the program is dominated more by going up and down the call stack in non-leaf functions).
Though, arguably, one could make more use of scratch-registers within non-leaf functions if one could be strategic about where the function calls occur and when/where register spills are needed (but, my compiler is not that clever; and mostly treats the scratch-registers as off-limits for local variables within non-leaf functions).
Then again, debatable it could all be for nothing:
My fastest case is only around 40% faster than the "gcc -O3" output (for programs like Doom and similar);
And, maybe, 40% isn't really enough to be worth the issues of a non-standard ISA variant.
But, granted, it is closer to around 500% for OpenGL (trying to build my OpenGL implementation with RV64 and GCC performs horribly). But, I kinda needed to SIMD the crap out of this (plain RV64G lacks any form of SIMD support).
...