Newsportal USENET - Re: "The provenance memory model for C", by Jens Gustedt

On 7/11/2025 3:48 AM, David Brown wrote:

On 11/07/2025 04:09, BGB wrote:
On 7/10/2025 4:34 AM, David Brown wrote:
On 10/07/2025 04:28, BGB wrote:
On 7/9/2025 4:41 AM, David Brown wrote:
On 09/07/2025 04:39, BGB wrote:
On 7/2/2025 8:10 AM, Kaz Kylheku wrote:
On 2025-07-02, Alexis <flexibeast@gmail.com> wrote:
>
...

>
Please don't call this "traditional behaviour" of compilers - be honest, and call it limited optimisation and dumb translation. And don't call it "code that assumes traditional behaviour" - call it "code written by people who don't really understand the language". Code which assumes you can do "extern float x; unsigned int * p = (unsigned int *) &x;" is broken code. It always has been, and always will be - even if it does what the programmer wanted on old or limited compilers.
>
There were compilers in the 1990's that did type-based alias analysis, and many other "modern" optimisations - I have used at least one.
>
>
Either way, MSVC mostly accepts this sorta code.
I remember reading in a MSVC blog somewhere that they had no plans to introduce type-based alias analysis in the compiler. The same blog article announced their advanced new optimisations that treat signed integer overflow as undefined behaviour and explained that they'd being doing that for years in a few specific cases. I think it is fair to assume there is a strong overlap between the programmers who think MSVC, or C and C++ in general, have two's complement wrapping of signed integers when the hardware supports it, as those who think pointer casts let you access any data.
And despite the blog, I don't believe MSVC will be restricted that way indefinitely. After all, they encourage the use of clang/llvm for C programming, and that does do type-based alias analysis and optimisation.
The C world is littered with code that "used to work" or "works when optimisation is not used" because it relied on shite like this - unwarranted assumptions about limitations in compiler technology.

This is why "-fwarpv -fno-strict-aliasing" need to be used so often when compiling pretty much anything non-trivial with GCC or Clang...

>
Also I think a lot of this code was originally written for compilers like Watcom C and similar.
>
>
Have noted that there are some behavioral inconsistencies, for example:
Some old code seems to assumes that x<<y, y always shifts left but modulo to the width of the type. Except, when both x and y are constant, code seems to expect it as if it were calculated with a wider type, and where negative shifts go in the opposite direction, ... with the result then being converted to the final type.
>
Meanwhile, IIRC, GCC and Clang raise an error if trying to do a large or negative shift. MSVC will warn if the shift is large or negative.
>
Though, in most cases, if the shift is larger than the width of the type, or negative, it is usually a programming error.
>
>
It's okay to be conservative in a compiler (especially when high optimisation is really difficult!). It's okay to have command-line switches or pragmas to support additional language semantics such as supporting access via any lvalue type, or giving signed integer arithmetic two's complement wrapping behaviour. It's okay to make these the defaults.
>
But it is not okay to encourage code to make these compiler-specific assumptions without things like a pre-processor check for the specific compiler and pragmas to explicitly set the required compiler switches. It is not okay to excuse bad code as "traditional style" - that's an insult to people who have been writing good C code for decades.
>
>
A lot of the code I have seen from the 90s was written this way.
>
Yes. A lot code from the 90's was written badly. A lot of code today is written badly. Just because a lot of code was, and still is, written that way does not stop it being bad code.

There is a tradeoff between "bad" code, and "code that gives the best performance" (but doesn't necessarily follow some peoples' definitions of "good").
If the idioms for working around TBAA and similar end up costing more in terms of performance than the performance gains of TBAA, it is not a win.

>
Though, a lot of it comes from a few major sources:
   id Software;
     Can mostly be considered "standard" practice,
     along with maybe Linux kernel, ...
   Apogee Software
     Well, some of this code is kinda bad.
     This code tends to be dominated by global variables.
     Also treating array bounds as merely a suggestion.
   Raven Software
     Though, most of this was merely modified ID Software code.
>
Early on, I think I also looked a fair bit at the Linux kernel, and also some of the GNU shell utilities and similar (though, the "style" was very different vs either the Linux kernel or ID code).
>
The Linux kernel is not a C style to aspire to. But they do at least try to make such assumptions explicit - the kernel build process makes it very clear that it requires the "-fno-strict-aliasing" flag and can only be correctly compiled by a specific range of gcc versions (and I think experimentally, icc and clang). Low-level and systems programming is sometimes very dependent on the details of the targets, or the details of particular compilers - that's okay, as long as it is clear in the code and the build instructions. Then the code (or part of it at least) is not written in standard C, but in gcc-specific C or some other non-standard dialect. It is not, however, "traditional C".

I sort of consider it "traditional", along with things like Doom and Quake and similar.
Granted, Doom wasn't too hard to get to be 64-bit clean and to work with TBAA. For example, I have builds that work moderately correctly in WSL with Xming. There was a "gotcha" with porting the Doom code to ARM based targets though, as ARM assumed char was unsigned by default, but the Doom engine assumed 'char' as signed, so needed to change some things to use 'signed char' explicitly.
Quake requires getting a little hacky in a few places (there are a few "problem areas" in the design of their "progs.dat" VM).
Though, not as bad as the "QVM" in "Quake 3 Arena", for which I was unsure of a good way to make this work on a 64-bit machine. Basically, it compiles C code for a 32-bit VM, which then promptly tries to share pointers directly with the rest of the Quake 3 engine.
At the time, the amount of wonk I would have needed to try to make QVM work in a 64-bit machine didn't seem worthwhile. Very likely it would have meant needing to add a virtual memory space and address-translation schemes, ... Though, Quake 3 allowed the fallback of going to a Quake 2 like strategy and compiling the game scripts as native DLL's / SO's.
Well, either this, or force all heap memory to be allocated in the low 4GB or similar. I don't usually like this option.

>
Early on, I had learned C partly by tinkering around with id's code and trying to understand what secrets it contained.
>
>
But, alas, an example from Wikipedia shows a relevant aspect of id's style:
https://en.wikipedia.org/wiki/ Fast_inverse_square_root#Overview_of_the_code
>
Which is, at least to me, what I consider "traditional".
The declaration of all the variables at the top of the function is "traditional". The reliance on a specific format for floating point is system-dependent code (albeit one that works on a great many systems). The use of "long" for a 32-bit integer is both "traditional" /and/ system-dependent. (Though it is possible that earlier in the code there are pre-processor checks on the size of "long".) The use of signed integer types for bit manipulation is somewhere between "traditional" and "wrong". The use of pointer casts instead of a type-punning union is wrong. The lack of documentation and comments, use of an unexplained magic number, and failure to document or comment the range for which the algorithm works and its accuracy limitations are also very traditional - a programming tradition that remains strong today.
It is worth remembering that game code (especially commercial game code) is seldom written with a view to portability, standard correctness, or future maintainability. It is written to be as fast as possible using the compiler chosen at the time, to be build and released as a binary in the shortest possible time-to-market.

It is a lot of the code I have seen, or messed with porting to my ISA.
So, a lot of BGBCC's design priorities were partly motivated by making this sort of code-porting relatively low hassle.
Though, my usual porting process was, roughly:
   Port code to Win64 via MSVC (main initial hassle, 64-bit stuff);
   Port code to my ISA (BGBCC);
   Port code to modern GCC (often more of a pain).
   Though, "-fwrapv -fno-strict-aliasing" makes it easier.
   Next stage is often to make it work without these.
Though, in some amount of the code for my own ISA, the use of direct pointer casting is the main option (though, often using 'volatile').
In other cases, the use of "memcpy()" is more of an "#ifdef __GNUC__" thing (partly as both GCC and Clang use this ifdef).

>
So:
memcpy(&i, &f, 8);
Will still use memory ops and wreck the performance of both the i and f variables.
>
Well, there you have scope for some useful optimisations (more useful than type-based alias analysis). memcpy does not need to use memory accesses unless real memory accesses are actually needed to give the observable effects specified in the C standards.
>
>
Possibly, but by the stage we know that it could be turned into a reg- reg move (in the final code generation), most of the damage has already been done.
>
Basically, it would likely be necessary to detect and special case this scenario at the AST level(probably by turning it into a cast or intrinsic). But, usually one doesn't want to add too much of this sort of cruft to the AST walk.
>
One thing to remember is that functions like "memcpy" don't have to be treated as normal functions. You can handle it as a keyword in your compiler if that's easiest. You can declare it as a macro in your <strings.h>. You can combine these, and have compiler-specific extensions (keywords, attributes, whatever) and have the declaration as a function with attributes. Your key aim is to spot cases where there is a small compile-time constant on the size of the memcpy.

I guess it is possible.

>
But, then, apart from code written to assume GCC or similar, most of the code doesn't use memcpy in this way.
>
So, it would mostly only bring significant advantage if pulling code in from GCC land.
How well do you handle type-punning unions? Do they need to be moved out to the stack, or can they be handled in registers?

Generally, yes.
BGBCC's handling of local structs and unions is along the lines of:
   Creates an internal hidden pointer;
   Reserve space in the frame;
   Initialize pointer to address in frame;
   Struct operations go through this pointer.
So, basically, it is sort of like:
   Foo foo;
Became:
   char t_foo[sizeof(Foo)];
   Foo *foo = ((Foo *)&t_foo);
A similar use of hidden internal pointers also applies to arrays.
Note that these pointers do not exist within structs or similar (so inline structs or arrays are stored as in typical ABIs), but rather merely exist within the local frame (which basically always handles structures and arrays via pointers).
So, from the compilers' POV, "foo.x" and "foo->x" are basically equivalent, and BGBCC doesn't actually bother to distinguish them (occasionally pops up as an issue when porting code from BGBCC to other compilers, as sometimes the wrong operator ends up being used, and the other compilers actually care about the distinction).
Union is basically a struct where all members are at offset 0.
So, type-punning via a union will go to memory and load from memory.
However, since the union doesn't require taking the address of the variables (which adds additional penalty) it would still be preferable in terms of performance vs the "memcpy()" option.
Note that for struct assignment, or returning a struct, it basically behaves as if a "memcpy()" call were used at that location.
In the RISC-V based targets, BGBCC does differ slightly from the official ABI in the handling of struct passing and return by value.
Rule is basically always:
   1-8 bytes: Register
   9-16 bytes: Register pair
   17+: Reference
For struct return, it passes the destination address in X28 (if it can't fit into a register pair, X11:X10).
The official RV ABI involves passing and returning structs via on-stack copy (inline to the argument list), and return by using the argument space for the returned struct.
But, BGBCC does this more like in the Win64 ABI.
Also, technically using a variant of the RV64 Soft-Float ABI as even though I am using an ISA with FPU registers, the Hard-Float ABI is worse for my uses.
It is actually more tempting to instead use F10..F17 for more arguments, say:
   X10..X17: Arguments 1-8
   F10..F17: Arguments 9-16
Vs, spilling any arguments past the stack (where, 16 arguments allows full register coverage of generally around 99.95% of all functions).
Where, one difference is "automatic COM-Interface thunks";
If one does the ABI like I have done, the thunks are a lot easier.
But, if one tries to do it like in the proper RV64 Hard-Float ABI, automatic COM thunks are going to be a pain (the thunks need to be generated one-off and the generator needs to actually know the argument list for each method, ...).
Also doing it my way makes the varargs mechanism easier as well, ...
Note though that within the local frame, if it is a struct or union, it is still generally always handled as a reference though. But, register or register pair is still generally preferable to using a memory-memory copy for small structs for performance reasons.
A gray area is SIMD vector types, but I can note that BGBCC does not treat vectors like structs. Instead, the SIMD vectors are purely value types (or, rvalue).
This is also one significant area, say, where "__m64" and "__m128" differ between BGBCC and MSVC, where in BGBCC they are value-like (rvalue), whereas in MSVC they are more struct-like (lvalue).
So, for example, in BGBCC one can do:
   __vec4f v0, v1;
   v1=v0.zyxw;
Or:
   v1=(__vec4f){v0.z,v0.y,v0.x,v0.w};
But, not:
   v1.x=v0.z; v1.y=v0.y;
   v1.z=v0.x; v1.w=v0.w;
...
The latter could possibly be faked, but:
   v1.x=v0.z;
Would be essentially equivalent to:
   v1=(__vec4f){v0.z,v1.y,v1.z,v1.w};
...
Some of this is partly because BGBCC essentially has a set of "core types" that divide up all other types:
   I: Int32 and smaller;
   L: Int64 and related.
   D: "double" and smaller floating-point types.
   A: Anything pointer-like (includes structs and arrays)
   X: Anything 128-bit goes here.
It was originally ILFDAX, but F got subsumed into D.
Can note that ILFDA is basically the same as the core type model used by the Java VM, and static-typed variants of the BGBScript VM (which had moved to a model of being JVM-like internally, but using a JavaScript/ActionScript style language on the surface).
But, can note that some of my early compiler and VM design stuff was influenced by the design of the JVM.
But, here, the use of a simplified set of "core types" can simplify the general design of the compiler any typesystem, since any sub-types of a given core type can mostly be handled with similar rules.
Can also note that the type-signature notation was also partly influenced by both the JVM and C++ name-mangling.
So, say:
   int i0; //"i"
   unsigned int i1; //"j"
   double    f0; //"d"
   int *pi0; //"Pi"
   int (*f)(long,float); //"(lf)i"
   ...

>
unsigned int f_to_u(float f) {
     unsigned int u;
     memcpy(&u, &f, sizeof(f));
     return u;
}
>
gcc compiles that to :
>
f_to_u:
     movd eax, xmm0
     ret
>
>
Yeah, it is more clever here, granted.
>
Meanwhile:
   i=*(uitn64_t *)(&f);
Will only wreck the performance of 'f'.
>
>
The best option for performance in BGBCC is one of either:
   i=__float64_getbits(f); //compiler intrinsic
   i=(__m64)f;              //__m64 and __m128 do a raw-bits cast.
>
Though, these options don't exist in the other compilers.
>
Such compiler extensions can definitely be useful, but it's even better if a compiler can optimise standard code - that way, programmers can write code that works correctly on any compiler and is efficient on the compilers that they are most interested in.
>
>
Possibly.
>
For "semi-portable" code, usually used MSVC style, partly as by adding 'volatile' it seemingly also works in GCC. Though, often with macro wrappers.
Code that has to be widely portable, with an aim to being efficient on many compilers and correct on all, always ends up with macro wrappers for this kind of thing, defined conditionally according to compiler detection.

Generally, yes.
The use of macro wrappers often ends up as a "necessary evil".

>
>
>
Implicitly, casting via __m64 or __m128 is a double-cast though. In BGBCC, these types don't natively support any operators (so, they are basically sort of like the value-equivalents of "void *").
>
>
So:
   memcpy(&i, &f, 8);      //best for GCC and Clang
   i=*(uitn64_t *)(&f);   //best for MSVC, error-prone in GCC
   i=(__m64)f;             //best for BGBCC, N/A for MSVC or GCC
>
In a lot of cases, these end up with wrappers.
>
GCC:
   static inline uitn64_t getU64(void *ptr)
   {
     uitn64_t v;
     memcpy(&v, ptr, 8);
     return(v);
   }
MSVC or BGBCC:
   #define getU64(ptr) (*((volatile uint64_t *)(ptr)))
>
Though, have noted that volatile usually works in GCC as well, though in GCC there is no obvious performance difference between volatile and memcpy, whereas in MSVC the use of a volatile cast is faster.
>
In gcc, a memcpy here will need to use a single memory read unless "getU64" is called with the address of a variable that is already in a register (in which case you get a single register move instruction). A volatile read will also do a single memory read - but it might hinder other optimisations by limiting the movement of code around.
>
>
Possibly.
>
When I tried benchmarking these before:
   GCC:
     Seemingly no difference between memcpy and volatile;
As I explained, that is to be expected in cases where the you can't get other optimisations that "volatile" would block. Usually simple timing benchmarks have fewer optimisation opportunities than real code.

   MSVC:
     Adding or removing volatile made no real difference;
That will, of course, depend on the benchmark. A volatile access will not normally take more time than a non-volatile access. But non- volatile accesses can be re-ordered, combined, or omitted in ways that volatile accesses cannot.

Yeah, pretty much.

     Using memcpy is slower.
As I explained.

   BGBCC: Either memcpy or volatile carries an overhead.
     The use of volatile is basically a shotgun de-optimization;
     If doesn't know what to de-optimize, so goes naive for everything.
>
Okay.

Other compilers might be more clever and be like, "OK, we only need to de-optimize this particular memory reference".
BGBCC is not that clever.
Though, generally the impact of, say:
   i=*(volatile uint32_t *)ptr;
Will be smaller than:
   volatile uint32_t *vptr;
   ...
   vptr=ptr;
   ...
   i=*vptr;
Since here BGBCC will be naive about everything having to do with vptr, rather than just the memory load. And, at the lower levels, it doesn't track which specific accesses are volatile, so effectively needs to disable load/store reordering for the whole basic-block.
Though, mostly applies to my XG1 and XG2 ISAs; for RISC-V and XG3 targets, the relevant instruction shuffling logic is not currently implemented (or, technically, at present this penalty always is in effect for these targets).
The idea being to try to reorder instructions to try to fit better into an in-order CPU pipeline. Though, it seems the in-order superscalar still does semi-OK even with naive instruction ordering.
Where, XG3 is basically my ISA shoehorned into the same encoding space as RISC-V, but replacing the 16-bit "compressed" instructions. The compiler then treats it like it was compiling for RISC-V, but has the ability to do various things that don't exist in normal RISC-V (producing a hybrid instruction stream).
Despite being kind of an ugly hack, XG3 does seem promising for my uses though.

>
On MSVC, last I saw (which is a long time ago), any use of "memcpy" will be done using an external library function (in an DLL) for generic memcpy() use - clearly that will have /massive/ overhead in comparison to the single memory read needed for a volatile access.
>
>
It is slightly more clever now, but still not great.
   Will not (always) generate a library call.
   Though, in VS2008 or similar, was always still a library call.
     VS2010 and VS2013 IIRC might setup and use "REP MOVSB" instead.
>
It will do it inline, but still often:
   Spill variables;
   Load addresses;
   Load from source;
   Store to destination;
   Load value from destination.
>
What BGBCC gives here is basically similar.
>
>
>
>
Don't want to use static inline functions in BGBCC though, as it still doesn't support inline functions in the general case.
>
>
>

Date	Sujet	#	Auteur
2 Jul 25	"The provenance memory model for C", by Jens Gustedt	9	Alexis
2 Jul 25	Re: "The provenance memory model for C", by Jens Gustedt	8	Kaz Kylheku
9 Jul03:39	Re: "The provenance memory model for C", by Jens Gustedt	7	BGB
9 Jul10:41	Re: "The provenance memory model for C", by Jens Gustedt	6	David Brown
10 Jul03:28	Re: "The provenance memory model for C", by Jens Gustedt	5	BGB
10 Jul10:34	Re: "The provenance memory model for C", by Jens Gustedt	4	David Brown
11 Jul03:09	Re: "The provenance memory model for C", by Jens Gustedt	3	BGB
11 Jul09:48	Re: "The provenance memory model for C", by Jens Gustedt	2	David Brown
11 Jul20:05	Re: "The provenance memory model for C", by Jens Gustedt	1	BGB