On 6/8/2024 1:28 PM, Malcolm McLean wrote:
On 07/06/2024 01:53, Lawrence D'Oliveiro wrote:
On Thu, 6 Jun 2024 15:38:21 -0500, BGB-Alt wrote:
>
*2: Seemingly the main way I am aware of to get small binaries is to use
an older version of MSVC (such as 6.0 to 9.0), as the binary-bloat
started to get much more obvious around Visual Studio 2010, but is less
of an issue with VS2005 or VS2008.
>
Newer version of proprietary compiler generates worse code than older
version?!?
If the code is calling extern gunctions that do IO, we woul expect these to be massively more sophisticated on a modern ststem Witha little comouter, pribtf just wtites acharacter raster and utimalthe he Os picks the up and flushes it out to a pixel raster. And that' aal it's doing. Whilst on a modrern syste, stdout can do whole lot of intricate things.
That is a whole lot of typos...
But, even if it is built calling MSVCRT as a DLL (rather than static linked), modern MSVC is still the worst of the bunch in this area.
A build as RISC-V + PIE with a static-linked C library still manages to be smaller than an x64 build via MSVC with entirely dynamic-linked libraries.
And, around 72% bigger than the same program built as a dynamic-linked binary with "GCC -O3" (while also often still being around 40% slower).
Contrast, VS2008 can build programs with binary sizes closer to those of GCC.
...
I have noted that there is often a curious level of similarity between code generation in MSVC and what I have managed to pull off in BGBCC, implying that they may be "similar" in terms of optimization and register allocation. Can't say whether or not it is the same algorithms, but GCC seems to work differently.
For example:
BGBCC assigns registers either on per-function basis, or allocated within a basic block;
Any registers that are allocated within a basic block, are spilled to the stack at the end of a basic block;
Those that are static assigned to registers, will always use this register within the function in question.
GCC appears to locally assign registers within basic blocks and across the edges between basic blocks, without either global assignment or needing to spill them to the stack.
Seemingly in contrast to both BGBCC and (seemingly) also MSVC.
Though, that said, compiling code with my compiler and ISA still seems to be beating RISC-V in terms of performance (despite GCC's valiant efforts with "-O3").
I suspect though this is mostly due to limitations of RISC-V, and GCC can only work with what it is given:
Only has a [Reg+Disp] addressing mode;
No way to express immediate values larger than 12 bits;
...
And, for PIE, performing function calls and accessing global variables via the GOT doesn't exactly help matters.
Say, for example:
For a non-PIE build, BJX2-XG2 and RV64G were roughly break-even in terms of code-size, with around a 10-20% performance delta;
For a PIE build (currently needed to load RV64 binaries in my kernel), there is around a 60% code-size delta, and around a 20-35% performance delta.
Can note that I am using an ABI design that still allows for dynamic linking and (also) for NOMMU operation similar to FDPIC, but doesn't hurt performance quite as badly.
Function calls still use PC relative displacements, except for dllimport (which would involve a trampoline).
Global variables are accessed as a load/store relative to the Global Pointer. The global pointer may be saved/restored in prolog's/epilogs and there is an ABI-defined ritual to get the global pointer reloaded to the correct value for the current image (note that loading a PC-relative address into the global pointer is not be valid in this ABI; as there may be multiple instances of a program running in the same address space each with their own unique global pointers).
However, saving/restoring and reloading the global pointer once in a function, is cheaper than accessing global variables indirectly via the GOT.
...
Granted:
There is a drawback that with my ABI, in that it can't implicitly share functions or variables across DLL boundaries.
Can also note that on x86-64, despite MSVC having similar behavior. and GCC using PIE on x86-64 in WSL that works in a similar way to PIE on RV64, the programs built as PIE on x86-64 are faster than those built via MSVC.
Granted, it is possible that x86-64 can hide a lot of the penalties from PIE via modern x86-64 CPUs almost universally being OoO (only real exceptions I am aware of being the early Atom cores, and early versions of the Xeon Phi).
...