On 3/3/2025 5:23 PM, Lawrence D'Oliveiro wrote:
On Mon, 3 Mar 2025 17:53:35 -0000 (UTC), Thomas Koenig wrote:
If your aim is small code size, it is better to compare output compiled
with -Os.
Then it becomes an artificial benchmark, trying to minimize code size at
the expense of real-world performance.
Remember, VAX was built for real-world use, not for academic benchmarks.
Yeah, they are not exactly the same...
FWIW:
In my own past testing, the general leaders in terms of code density were Thumb2 and 32-bit i386.
People claim RV32GC matches Thumb2 here, my own experience seems to disagree.
Though, while Thumb2 code is small, it isn't necessarily particularly fast.
Though, from what I can gather, it is pretty close performance-wise between Thumb2 and RV32GC or similar.
Previously I had thought x86-64 was doing unreasonably bad with code density, but have since realized it is partly a choice of compiler and compiler options (and it can actually do pretty well here).
Though, can note:
-O3 vs -Os makes relatively little difference on RISC-V;
It makes a big difference for x86-64:
There is also a pretty notable speed difference between -O3 and -Os as well with x86-64 ("-O3" tending to be a fair bit faster). Where, for size-optimized x86-64 had used "-Os -ffunction-sections -fdata-sections -Wl,-gc-sections".
For modern MSVC (mostly running VS 2022), code is bloated regardless of settings ("/Os" exists but is seemingly mostly equivalent to "/O1"). Both are a little smaller than "/O2", but the generated code is still quite large.
Seemingly, older versions (such as VS 2008) did much better here.
Though, the newer versions have other advantages:
C99 support;
Better performance for compiled code;
...
Though, I ended up going the opposite direction (making choices to try to favor speed over code density) and seemingly have ended back at having moderately good code density for XG2 and XG3 despite the lack of 16-bit instructions (they are primarily 32/64 rather than 16/32).
Theoretically, an "/Os" option exists for BGBCC, but doesn't actually accomplish much with XG2 or XG3, as a lot of the settings it tweaked are mostly N/A for these.
Well, main things it effects are mostly:
Cut-off point between inline memcpy and emitting a function call;
Whether to use a single or multiple conditional branches for a loop;
...
It would alter register allocation heuristics, say:
Low pressure:
Only enable R8..R14 for register allocation;
Medium Pressure:
Enable R24..R31;
High pressure:
Enable R40..R47 and R56..R63;
The optimization settings would adjust the cut-off points.
With XG1, register choice would effect needed instruction lengths in many cases.
R8..R14: Accessible from 16-bit ops;
R24..R31: Accessible from 32-bit ops;
Others: May require 64-bit encodings.
This becomes N/A, as the newer ISA variants always use the same encoding length regardless of register used.
But, yeah, rough code-density ranking (best to worst, with Doom ".text" sizes in kB):
XG1 (263)
XG2 (296)
XG3 (324)
XG1_Fix32 (347, *1)
RV64G+Jx (359)
RV64GC (371)
RV64G (449)
*1: Note that XG1_Fix32 represents a common subset of XG1 and XG2, lacking 16-bit ops, but also always needing to use jumbo prefixes to encode R32..R63.
For reference, an x86-64 "-O3" build is 244K here, but may not count as it is using a dynamically-linked C library. Also, x86-64 has a far more complex instruction encoding scheme (vs either BJX2 or RISC-V).
If ranking in terms of performance (best to worst, Doom framerate):
XG2 (25 fps)
XG3 (23 fps)
XG1_Fix32 (21 fps)
RV64G+Jx (20 fps)
XG1 (18 fps)
RV64G (17 fps)
RV64GC (14 fps)
(Framerate given as the value at the start of E1M1).
In many cases, performance has ended up as a higher priority than code density.