On 7/4/2024 8:05 PM, Lawrence D'Oliveiro wrote:
It’s called “Rust”.
If anything, I suspect may make sense to go a different direction:
Not to a bigger language, but to a more narrowly defined language.
Basically, to try to distill what C does well, keeping its core essence intact.
Goal would be to make it easier to get more consistent behavior across implementations, and also to make it simpler to implement (vs an actual C compiler); with a sub-goal to allow for implementing a compiler within a small memory footprint (as would be possible for K&R or C89).
Say for example:
Integer type sizes are defined;
Nominally, integers are:
Twos complement;
Little endian;
Wrap on overflow.
Dropped features:
VLAs
Multidimensional arrays (*1)
Bitfields
...
Simplified declaration syntax (*2):
{Modifier|Attribute}* TypeName Declarator
*1: While not exactly that rare, and can be useful, it is debatable if they add enough to really justify their complexity and relative semantic fragility. If using pointers, one almost invariably needs to fall back to doing "arr[y*N+x]" or similar anyways, so it is arguable that it could make sense to drop them and have people always do their multidimensional indexing manually.
Note that multidimensional indexing via multiple levels of pointer indirection would not be effected by this.
*2: This can be used to both make parsing easier, and also make parsing faster, as it can eliminate needing to lookup symbols to see if they were a previously defined typedef or similar (which in effect in C ends up being needed for pretty much every non-keyword identifier encountered here; ideally you don't want to need to do it at all in the parsing stage).
Say, integer types:
sbyte, byte/ubyte: 8 bits
short, ushort: 16 bits
int, uint: 32 bits
long, ulong: 64 bits
intNN/uintNN: Explicit sized types, may map to the above.
Unclear:
char: 8-bit, unsigned
Could go either way, signed is more traditional,
but unsigned makes more logical sense here.
wchar: 16-bit, unsigned
Arrays:
Basic array types are always one dimensional;
Type[] will alias with "Type*" in most contexts.
Would likely drop C's function pointer syntax, likely in favor of, say:
typedef int fooFunc_t(); //declare a function type
fooFunc_t *fptr; //actual function pointer
Similarly, structs may not be declared at the point of use, but only as types.
struct FooStruct {
int x, y;
}
FooStruct *fs; //pointer to FooStruct
Where, declaring a struct will also behave as-if it had also been implicitly typedef'ed with the same name.
Struct semantics would be tweaked:
A by-value struct will behave as if it were pass-by-reference with copy-on-assignment (as is typically the case when structs are used as lvalues).
Would make some other restrictions:
Variable declarations are only allowed in the top-level block of a function, and (regardless of location) will always behave as if they were declared at the top of the function.
Initial values in a declaration may only be a constant expression or a reference to a global declaration (including for local variables).
Expressions like sizeof() and offsetof() will not (necessarily) be seen as constants (except if the value may be trivially determined). Note that these will also be only valid for type names, not for the type of an arbitrary expression.
Note that only certain expressions (such as variable assignments or function calls) will be allowed in statement context (most other expressions would not be allowed).
...
So, for example:
int Foo(int x, int y)
{
BAD:
int z=x/y; //not allowed, not constant
OK:
int z;
z=x/y;
if(z>10)
{
BAD:
int w; //declaration is not allowed here
z+=3; //OK
z*4; //BAD, expression not allowed as statement
}
}
Maybe:
Pointers may be allowed to be bounds-checked;
But, casts between pointer and integer types will be restricted.
An implementation will be allowed to disallow this.
Granted, this would disallow traditional forms of pointer tagging.
An implementation may instead provide optional intrinsics for working with pointer tagging (in place of raw casts and bit-twiddling). Though, this would mean one would either need a runtime that is aware of type-tagging, or allow for implementations which forbid pointer tagging entirely (likely requiring a fallback to other strategies, such as boxed values).
Though, in this case, requiring the runtime to be a little more clever is an easier sell than trying to deal with it in the compiler.
...
Will also add a restriction to break and continue:
They will only be valid within the body of a loop, or within an if/else block within the loop. Nearly any other constructs (such as another loop or a "switch()" will entirely hide the visibility of the outer break or continue).
...
Possible functional difference:
Will use explicit module importing rather than headers.
Modules will be parsed top-to-bottom, with the ability to see into any imported modules. Each module will only be exported once, with a logical declaration order based on a DAG walk.
Preprocessing defines/macros would not carry across module boundaries.
Modules would function in a way partway between headers and static libraries, likely being built in advance, but pulled into the compiler stage (likely with a manifest defining any types or global declarations within the module). Ideally, the goal would be to allow for implementation both with separate compilation (such as COFF or ELF objects; where likely the object code and manifest would exist separately) or with a bytecode IR (which would likely combine both into a single entity). Ideally, it should be possible to determine module dependency order without fully invoking the compiler (say, such that the logic for compiling each module, and scheduling the compilation of modules, can operate independently).
But, admittedly, I have had good results using a stack-machine IR in my compilers for things like static libraries, so leveraging similar technology could still make sense.
Though, would want to do a few things differently from my current IR to be able to reduce memory footprint; as my current IR was designed in such a way that it is effectively necessary to load in the whole program before code-generation can be done. Ideally one would want the compiler to be able to read-in the IR as-needed and discard the parts it is already done with (but, existing efforts to redesign the IR stage here have tended to fizzle; would effectively need to redesign the compiler backend to be able to make effective use of it).
For a new compiler, could make sense to try to "do it right this time" (though, not too much of an issue if running the compiler on a modern PC, as they are "not exactly RAM constrained" in this area; so reading in and decoding the IR and symbol tables for an entire executable image at the same time, is not too much of an issue).
Though, one likely option here would be to use a similar system to "package" and "import" in Java (in strong contrast to how these keywords were used in ActionScript, which would have more in common with "namespace" and "using" in C++ and C#).
Potentially, a largish program could use something akin to a Unix-style filesystem model for managing package importing (with libraries being effectively mounted within a local VFS, potentially also allowing for "symlinks" within the package space).
Note that within the modules, it would still behave as-if it were a C style global namespace (it need not necessarily introduce any additional scoping semantics, unlike Java).
Scoping semantics could help, but would add some additional complexity to the compiler. If supported, one option would be to support these also using "namespace" and "using" keywords (likely used in a similar way to their usage in C#).
If pulled of well, such a module system could be both faster and require less memory use in the compiler if compared with headers (where in many programs, the amount of time and memory spent processing headers significantly exceeds that spent processing the actual code within each translation unit).
There is the typical workaround of include'ing everything into a single big translation unit (AKA: "unity builds"), but while this is fast, it can still eat a significant amount of memory in the compiler. Better might be to provide an alternative both to textual header inclusion and unity builds, while ideally allowing for fast and low-overhead compilation (reading in a binary manifest of declared types and symbols likely being a bit cheaper).
Even with unity builds, build times can still get annoying for bigger programs. And here, I am talking like 20 seconds to rebuild a 250kLOC program. Granted, even if parsing is fast, this still leaves the challenge of fast/efficient machine-code generation.
( Cough, and not like the horribly slow compile times one sees trying to rebuild LLVM... )
...
It need not be directly backwards compatible with C, but ideally should be close enough that "copy paste translation" shouldn't be too much effort (IOW: There shouldn't be any drastic changes in terms of syntax nor significant changes to commonly used features).
But, I don't know, just some idle thoughts at the moment.
Or basically, sort of like a hybridized common subset of C and a C-like subset of "unsafe" C# (IOW: mo OOP nor Generics, nor a garbage collector, ...).