On 22/11/2024 12:33, Waldek Hebisch wrote:
Bart <bc@freeuk.com> wrote:
>
Sure. That's when you run a production build. I can even do that myself
on some programs (the ones where my C transpiler still works) and pass
it through gcc-O3. Then it might run 30% faster.
On fast machine running Dhrystone 2.2a I get:
tcc-0.9.28rc 20000000
gcc-12.2 -O 64184852
gcc-12.2 -O2 83194672
clang-14 -O 83194672
clang-14 -O2 85763288
so with 02 this is more than 4 times faster. Dhrystone correlated
resonably with runtime of tight compute-intensive programs.
Compiler started to cheat on original Dhrystone, so there are
bigger benchmarks like SPEC INT. But Dhrystone 2 has modifications
to make cheating harder, so I think it is still reasonable
benchmark. Actually, difference may be much bigger, for example
in image processing both clang and gcc can use vector intructions,
with may give additional speedup of order 8-16.
30% above means that you are much better than tcc or your program
is badly behaving (I have programs that make intensive use of
memory, here effect of optimization would be smaller, but still
of order 2).
The 30% applies to my typical programs, not benchmarks. Sure, gcc -O3 can do a lot of aggressive optimisations when everything is contained within one short module and most runtime is spent in clear bottlenecks.
Real apps, like say my compilers, are different. They tend to use globals more, program flow is more disseminated. The bottlenecks are harder to pin down.
But, OK, here's the first sizeable benchmark that I thought of (I can't find a reliable Dhrystone one; perhaps you can post a link).
It's called Deltablue.c, copied to db.c below for convenience. I've no idea what it does, but the last figure shown is the runtime, so smaller is better:
c:\cx>cc -r db
Compiling db.c to db.(run)
DeltaBlue C <S:> 1000x 0.517ms
c:\cx>tcc -run db.c
DeltaBlue C <S:> 1000x 0.546ms
c:\cx>gcc db.c && a
DeltaBlue C <S:> 1000x 0.502ms
c:\cx>gcc -O3 db.c && a
DeltaBlue C <S:> 1000x 0.314ms
So here gcc is 64% faster than my product. However my 'cc' doesn't yet have the register allocator of the older 'mcc' compiler (which simply keeps some locals in registers). That gives this result:
c:\cx>mcc -o3 db && db
Compiling db.c to db.exe
DeltaBlue C <S:> 1000x 0.439ms
So, 40% faster, for a benchmark.
Now, for a more practical test. First I will create an optimised version of my compiler via transpiling to C:
c:\mx6>mc -opt mm -out:mmgcc
M6 Compiling mm.m---------- to mmgcc.exe
W:Invoking C compiler: gcc -m64 -O3 -ommgcc.exe mmgcc.c -s
Now I run my normal compiler, self-hosted, on a test program 'fann4.m':
c:\mx6>tm mm \mx\big\fann4 -ext
Compiling \mx\big\fann4.m to \mx\big\fann4.exe
TM: 0.99
Now the gcc-optimised version:
c:\mx6>tm mmgcc \mx\big\fann4 -ext
Compiling \mx\big\fann4.m to \mx\big\fann4.exe
TM: 0.78
So it's 27% faster. Note that fann4.m is 740Kloc, so this represents compilation speed of just under a million lines per second.
Some other stats:
c:\mx6>dir mm.exe mmgcc.exe
22/11/2024 14:43 393,216 mm.exe
22/11/2024 14:37 651,776 mmgcc.exe
So my product has a smaller EXE too. For more typical inputs, the differences are narrower:
c:\mx6>copy mm.m bb.m
c:\mx6>tm mm bb
Compiling bb.m to bb.exe
TM: 0.09
c:\mx6>tm mmgcc bb -ext
Compiling bb.m to bb.exe
TM: 0.08
gcc-O3 is 12% faster, saving 10ms in compile-time. Curious about how tcc would fare? Let's try it:
c:\mx6>mc -tcc mm -out:mmtcc
M6 Compiling mm.m---------- to mmtcc.exe
W:Invoking C compiler: tcc -ommtcc.exe mmtcc.c c:\windows\system32\user32.dll -luser32 c:\windows\system32\kernel32.dll -fdollars-in-identifiers
c:\mx6>tm mmtcc bb
Compiling bb.m to bb.exe
TM: 0.11
Yeah, a tcc-compiled M compiler would take 0.03 seconds longer to build my 35Kloc compiler than a gcc-O3-compiled one; about 37% slower.
One more point: when gcc builds my compiler, it can use whole-program optimisation because the input is one source file. So that gives it an extra edge compared with compiling individual modules.