On 5/29/2026 6:22 AM, David Brown wrote:
On 29/05/2026 12:20, BGB wrote:
On 5/29/2026 2:52 AM, Janis Papanagnou wrote:
On 2026-05-28 11:57, BGB wrote:
On 5/28/2026 2:18 AM, Janis Papanagnou wrote:
On 2026-05-28 01:49, BGB wrote:
[...]
>
But, not really an "easy" way to avoid bloat, other than to write code specifically for what cases are relevant; while also avoiding needless duplication and copy paste (where, overuse of copy/paste can also lead to bloat; along with turning the code into an ugly mess).
>
Hmm.. - as said, the during very early days there were issues; I
recall on one platform duplication of template code in more that
one source unit. And/or some environmental hacks (of the compiler)
to deposit template code for linking. In the later days I've not
seen such immature things anymore.
>
>
Possibly, a lot could depend on how one is counting things as well.
>
>
In a lot of cases when using GCC, I end up using:
-ffunction-sections -fdata-sections -Wl,-gc-sections
On many targets, "-fdata-sections" can lead to noticeably larger and slower code because it effectively eliminates section anchor optimisations. It does not negatively affect x86 AFAICS, because x86 does not use section anchors.
<https://godbolt.org/z/zeoq41Y7d>
With -fsection-anchors (enabled with optimisation on targets that support it - generally RISCy load/store architectures), program-lifetime variables are kept together in a lump (as though they were in a struct) and often addressed by a pointer to that pretend struct. Thus if a function accesses two variables "a" and "b", instead of having to load the addresses of each of "a" and "b" into separate registers, it loads an "anchor" into one register and accesses the variables with reg+offset addressing.
I've seen "-fdata-sections" used regularly in embedded systems - it is almost always a bad idea.
("-ffunction-sections" is often very helpful to reduce code image size, so keep that one.)
Both seem to help on x86, x86-64, and also on RISC-V, at making GCC's output at least sorta space-comparable to my own compilers.
The merit of "-fdata-sections" is mostly that it eliminates unused global variables; whereas "-ffunction-sections" eliminates unreachable functions.
Neither is needed with my own compiler, which compiles things in a way such that it eliminates anything that is unreachable.
Both posed an issue initially when porting ROTT, because in some cases it relied on the ability to go out-of-bounds for one array to access data in another array. I ended up reworking some of these cases though to use a single larger array.
Have noted though that GCC targeting RISC-V still tends to produce fairly large binaries even with "-Os". Its code for the basic subset (RV64G) does tend to be a little faster than what BGBCC generates, but also a fair bit more bulky. Though, the final ELF file ends up bigger still, as a significant chunk of the file ends up needing to hold ELF related metadata (comparably, PE/COFF can end up much leaner here).
Though, on the other side, with modern MSVC, despite the relative leanness of the PE/COFF format, MSVC tends to produce binaries with much larger ".text" sections.
This issue was a lot less with VS2008 though, which tended to generate less-bloated binaries (with code-size more competitive with GCC).
Also in modern MSVC, there is little distinction between "/O1" and "/Os", both being more space-efficient than "/O2" (though, "/O2" is usually faster, but also more prone to misguided attempts at auto-vectorization).
>
Because otherwise it likes wasting code space by retaining unreachable functions.
>
Using "static inline" functions also carries a risk because the can end up duplicated across multiple translation units, or in multiple places within the same translation unit, so is best used sparingly.
>
Usually you would only use static inline functions for small functions in headers, where they are a better choice than function-like macros. In a C file, there is rarely much point in declaring a function "inline" - optimising compilers will inline or not as they see fit, without regard for "inline". "static" on its own is, of course, always a good idea for functions or data that is not "exported" by the current translation unit, and will often make generated code smaller.
How much or how little duplication of code there will be within one translation unit will depend on compiler settings and the rest of the code, and not on whether or not you use "inline".
OK.
But, yeah, small functions are usually better than macros in at least that the compiler can avoid duplicating them (or maybe merge them between translation units when it notices that the contents are identical).
>
>
As for assembler:
Main reasons not to use assembler for everything:
Needlessly verbose;
Non-portable.
>
However, often one can still end up writing C code that looks like assembler sometimes, as this is often an effective way to optimize things.
>
Say, for example:
v0=cs[0];
v2=cs[2];
v1=cs[1];
v3=vs[3];
ct[0]=v0;
ct[2]=v2;
ct[1]=v1;
ct[3]=v3;
Vs:
ct[0]=cs[0];
ct[1]=cs[1];
ct[2]=cs[2];
ct[3]=cs[3];
>
Because the extra variables can avoid help sidestep latency from the load instructions and staggering stores can avoid penalties of two adjacent stores to the same cache-line in some cache architectures. Where, in the latter case, the compiler may fail to as effectively avoid the load-latency or realize the need to stagger the stores for best performance, ...
That might be the case for a very simplistic compiler. With an optimising compiler, these extra variables will quickly be eliminated. If the compiler has a good scheduling model of the device, it do whatever instruction scheduling works best for that processor. If the model is not good enough, it will be suboptimal. I would not, however, expect any different in the generated code for the two code snippets.
Sometimes this kind of "manual optimisation" is helpful when you have to try to get efficient results from a weak compiler, however.
Possibly, but this sort of thing can help with both BGBCC and with MSVC IME.
While BGBCC does use a shuffle-to-reorder instructions things, it may fail to do so in some cases:
If the instructions end up mapped to the same CPU register;
If its heuristics can't prove non-alias.
Though, in the simple example given, it could (probably) turn the latter into the former, but "better" to write code such that things are in closer to the optimal order by default.
Note that using different variables with overlapping scopes reduces the likelihood of the compiler assigning both to the same register, which is a much more real risk if relying on implicit temporaries (whose lifetime only exists within a single expression).
But, in my case, a lot of this comes down to trying to tweak the compilers' internal register allocation heuristics for best results (and the tight balance between how many registers to save/restore for the function, vs avoiding assigning short-lived temporaries to the same register too quickly and hindering the instruction-scheduling).
Arguably, could make sense to instead do the reordering at the 3AC level, rather than reordering at the level of ISA instructions, but this is just sorta how I ended up doing things (and one can know the effective timing latency of a CPU instruction a lot more easily than a 3AC op).
>
>
Usual strategy is to try to limit how much code is written, and also to avoid doing things in ways that result in too much code, or too much cruft.
>
Best to avoid both copy paste when reasonable, and sticking anything non-trivial in macros.
>
We avoided macros if possible.
>
>
They are de-facto for constants and similar, but for longer stuff is better avoided.
Macros are rarely the best way to define constants. They are needed if you are using the constants for pre-processor stuff like conditional compilation. But generally you get clearer code, better typing, and potentially several other benefits from using alternative choices like "enum" (even for stand-alone integer constants), "static const" variables, and in C23, "constexpr" variables. There's no doubt that a lot of code /does/ use macros for constants, but I view it as a relic of the past rather than good coding practice.
They are traditional...
Like:
static const double M_PI = 3.14159265358979;
Could also make sense, but people don't do usually this, they usually use macros...
In BGBCC, both can be handled as constants, just they end up being handled at different stages:
#define: Constant ends up inlined in the preprocessor/parse stage;
const: Constant shows up in the "reducer" (which evaluates constant expressions).
Where, as noted, BGBCC's pipeline looks kinda like:
Toplevel:
Ingest each named source file;
Then, in the C case, per translation unit:
Preprocess;
Parse;
Frontend Compile + Reduce;
This does an AST walk, but at each stage,
invokes the reducer to see if it can perform AST level rewrites;
Reducer can also implement some edge-case features.
So, is mostly necessary, vs an optional optimization thing.
Emits output as a Stack IL.
May be output to a file, or used as input to next stage.
The stack IL partly resembles a mix of JVM and .NET bytecode.
The IL ops themselves operate more like in .NET bytecode.
This serves the role of static libraries and object files.
For a static library, all the stack IL gets blobbed together.
So, every translation unit ends up effectively appended on.
Middle Stage (processes IL Blobs):
Processes Stack IL, translates to 3AC (loosely SSA form);
Builds a big table of all global declarations, etc.
Backend:
Walks call-graph to determine dependencies;
Unreachable functions/globals/etc are marked as culled.
Ranks all the functions and variables by priority;
Sorts them into roughly priority order;
Then does shuffling to try to density-optimize globals;
Swaps globals when doing so would allow more memory density.
May also apply random shuffling and clustering heuristics.
Then, compiles each function:
Figure out stack-frame layout,
how many registers to reserve,
etc.
Emit machine code for 3AC ops;
Try to shuffle instructions to improve instruction scheduling;
Or, if a variable:
Figure out whether it goes in ".data" or ".bss"
If initialized, deal with initialization stuff;
...
Or, an ASM Blob:
Assemble it.
Or, Ingests contents that go into ".rsrc" section;
May involve image and audio converters, etc;
BGBCC uses different resource sections from Windows though.
...
Output:
Gets is input as a set of sections, symbols, and relocs;
Figures out layout within the output image (eg, PE/COFF);
Figures out how much space it needs for base relocs, etc.
Builds up a table of "initial base-relocs"
Splats the sections into the image buffer;
Applies relevant relocs;
Sorts base relocs by RVA;
Generates actual ".reloc" section contents.
Fill in PE/COFF headers and similar;
If applicable, LZ4 compress the image.
I tend to store EXE's in LZ4 compressed form,
the image is decompressed during load.
This format leaves the initial PE/COFF headers uncompressed.
Need the headers to figure out where to load the image.
Else would need a temporary buffer to decompress into.
Typical loader process:
Look at headers;
Figure out where to load to, etc;
Read in (or decompress) image contents;
Apply base relocs;
Pull in any DLLs, etc;
Go.
The LZ4 compression is mostly because:
Loader is often IO bound;
May save memory in some cases;
LZ4 decompression is faster than more IO;
It also seems to be effective against program binaries (*).
*: I have my RP2 format, which generally does better for general purpose data compression, but slightly worse for compressing program binaries, so LZ4 has mostly won here. Also generally don't want a "stronger" compressor, like Deflate, both because an Inflater is a much bigger chunk of code, and also much slower than LZ4.
Can note that BGBCC also mostly takes over the role of the "resource compiler" as well, so can process resources. These are generally listed as a text file of entries to import, giving an internal "lumpname", external filename, and a tag to specify which file conversions to apply.
I am using a vert different resource section type than Windows though, in that I just sorta replaced it with a modified version of the Quake WAD2 format (not to be confused with PAK, where PAK serves a different role). Note that the WAD2 directory in this case uses RVA's and not WAD-file offsets (so, effectively, it is integrated into the PE/COFF image, not just a WAD file that was shoved in).
Generally, one can access lumps from C land with declarations like:
extern unsigned char __rsrc_lumpname[];
Typically, formats used internally are things like BMP and WAV. Though, when using BMP, it is typically 16 color or 256 color to avoid wasting space. Sometimes monochrome or 4 color. One downside of BMP is that for a full 256-color palette it needs 1K of memory just for the palette, tempting to consider a non-standard variant that uses RGB555 for the palette (thus reducing it to 512 bytes). For small images it is often smaller to store them as 16-bit hi-color to avoid the space penalty of the color palette.
There are already non-standard BMP variants though, like BMP with LZ compression. A lot depends on what is needed for a particular use-case.
For WAV typical formats are 2 or 4 bit ADPCM at 8/11/16 kHz.
2 bit ADPCM: 16/22/32 kbps.
4 bit ADPCM: 32/44/64 kbps.
Have found encoder-side tricks to make ADPCM more compressible with LZ4.
Basically, it tries to do a reverse LZ search when encoding and encodes audio following patterns when the pattern would be a "close enough" match.
had also experimented before with using some trickery involving FIR filters and lookup tables to improve perceptual quality of 8kHz/2b ADPCM to try to make it sound "less like total crap". But, this requires additional metadata and a more complex process to decode (and to get best results with this will result in worse audio quality if the audio is just naively decoded as 8kHz/2b ADPCM without the filters).
But, yeah, with these tricks can reduce the effective bitrate (when LZ4 compressed) down to around 8-12 kbps. Note that while entropy coding could help more, it is modest, and the most effective strategy (range coding) being mostly too slow to be worthwhile.
Also, of the things I have tested, ADPCM was still the front runner for "actually passable" audio quality in this domain (to me, some of the modern cellphone codecs sound like unintelligible broken garbage, and require much more complex decoders, not worth the bother).
only thing I have found that gets much lower bitrate is, say:
One divides the audio up into chunks of 64 samples (1/125 second for 8kHz);
Pick the top 4 square-waves from between 1 and 4 kHz;
Encode the phase and intensity of each square wave.
Typically, the strategy was to break it into 4 half-octaves and pick the highest peak in each half-octave; and then totally ignore everything below 1kHz. If the frequency and amplitude are encoded in around 16 bits each, this achieves an effective bitrate of 8 kbps.
Though, another strategy was 8 quarter octaves and pick the top 4 loudest.
But, audio quality is worse than 2b ADPCM.
Can push it to 4kbps by only encoding the top 2 waveforms.
But, then speech sounds robotic and borderline unintelligible.
Note that dropping to 62.5 Hz sampling also makes speech unintelligible.
While traditionally, this used sine-waves (sinewave synthesis) I had better results with square waves (simpler/cheaper, also better results audio-wise). Computational cost for decoding is fairly modest (mostly some "for()" loops and fixed-point arithmetic).
Though, effective bitrate may be lower, because it seems that speech encoded this way is often LZ compressible as well (and can be helped along with pattern matching tricks).
...
But, yeah, generally want images and audio to be fairly compact when shoving them inside an EXE or DLL, for more general asset data, generally better to use an external file.
I had often used a custom "WAD4" format here, which is kinda like "WAD2 but with longer names and a directory tree". It then exists as a lower cost option to the ZIP format (while semi-popular, ZIP is a high-overhead format to be used this way).
Also can use WAD4 as a sort of VFS packaging.
>
>
But, things can be considered in relative terms:
Like, C++ may carry various penalties vs C.
>
I don't find C++ carries noticeably penalties compared to C, for my embedded work. But I do disable exceptions and RTTI - exceptions may have very little run-time time overhead, but the unwind tables can be significant when code size is important in small systems.
Yes, that is the main thing.
They carry zero performance penalty in practice;
But, have a non-zero penalty for image size.
Not enough to be a deal-breaker towards using them if they are used, but enough that one wants them disabled if not used...
Haut de la page
Les messages affichés proviennent d'usenet.
NewsPortal