Bart <
bc@freeuk.com> wrote:
On 22/11/2024 12:33, Waldek Hebisch wrote:
Bart <bc@freeuk.com> wrote:
>
Sure. That's when you run a production build. I can even do that myself
on some programs (the ones where my C transpiler still works) and pass
it through gcc-O3. Then it might run 30% faster.
On fast machine running Dhrystone 2.2a I get:
tcc-0.9.28rc 20000000
gcc-12.2 -O 64184852
gcc-12.2 -O2 83194672
clang-14 -O 83194672
clang-14 -O2 85763288
so with 02 this is more than 4 times faster. Dhrystone correlated
resonably with runtime of tight compute-intensive programs.
Compiler started to cheat on original Dhrystone, so there are
bigger benchmarks like SPEC INT. But Dhrystone 2 has modifications
to make cheating harder, so I think it is still reasonable
benchmark. Actually, difference may be much bigger, for example
in image processing both clang and gcc can use vector intructions,
with may give additional speedup of order 8-16.
30% above means that you are much better than tcc or your program
is badly behaving (I have programs that make intensive use of
memory, here effect of optimization would be smaller, but still
of order 2).
The 30% applies to my typical programs, not benchmarks. Sure, gcc -O3
can do a lot of aggressive optimisations when everything is contained
within one short module and most runtime is spent in clear bottlenecks.
Real apps, like say my compilers, are different. They tend to use
globals more, program flow is more disseminated. The bottlenecks are
harder to pin down.
But, OK, here's the first sizeable benchmark that I thought of (I can't
find a reliable Dhrystone one; perhaps you can post a link).
First Google hit for Dhrystone 2.2a
https://homepages.cwi.nl/~steven/dry.chttps://homepages.cwi.nl/~steven/dry.c
(I used this one).
Compiled in two steps like:
gcc -c -O -o dry.o dry.c
gcc -o dry2 -DPASS2 -O dry.c dry.o
If you want something practical, I have the following C function:
#include <stdint.h>
void inner_mul(uint32_t * x, uint32_t * y, uint32_t * z,
uint32_t xdeg, uint32_t ydeg, uint32_t zdeg, uint32_t p) {
if (ydeg < xdeg) {
uint32_t * tmpp = x;
uint32_t tmp = xdeg;
x = y;
xdeg = ydeg;
y = tmpp;
ydeg = tmp;
}
if (zdeg < xdeg) {
xdeg = zdeg;
}
if (zdeg < ydeg) {
ydeg = zdeg;
}
uint64_t ss;
long i;
long j;
for(i=0; i<=xdeg; i++) {
ss = z[i];
for(j=0; j<=i; j++) {
ss += ((uint64_t)(x[i-j]))*((uint64_t)(y[j]));
}
z[i] = ss%p;
}
for(i=xdeg+1; i<=ydeg; i++) {
ss = z[i];
for(j=0; j<=xdeg; j++) {
ss += ((uint64_t)(x[j]))*((uint64_t)(y[i-j]));
}
z[i] = ss%p;
}
for(i=ydeg+1; i<=zdeg; i++) {
ss = z[i];
for(j=i-xdeg; j<=ydeg; j++) {
ss += ((uint64_t)(x[i-j]))*((uint64_t)(y[j]));
}
z[i] = ss%p;
}
}
and the following test driver:
#include <stdio.h>
#include <stdint.h>
#include <sys/time.h>
extern void inner_mul(uint32_t * x, uint32_t * y, uint32_t * z,
uint32_t xdeg, uint32_t ydeg, uint32_t zdeg, uint32_t p);
int main(void) {
uint32_t x[85], y[85], z[169];
int i;
for(i=0;i<85;i++) {
x[i] = 1;
y[i] = 1;
}
struct timeval tv1, tv2;
gettimeofday(&tv1, 0);
int j;
for(j=0; j < 100000; j++) {
for(i=0;i<169; i++) {
z[i] = 1;
}
inner_mul(x, y, z, 84, 84, 168, 1000003);
}
gettimeofday(&tv2, 0);
for(i=0;i<12; i++) {
printf(" %u,", z[i]);
}
putchar('\n');
long tt = tv2.tv_sec - tv1.tv_sec;
tt *= (1000*1000);
tt += (tv2.tv_usec - tv1.tv_usec);
printf("Time: %ld us\n", tt);
return 0;
}
At least for gcc and clang put them is separate files to avoid
simplifing the task too much ('inner_mul' is supposed to work
with variable data, here we feed it the same thing several times).
Of course, the test driver is silly, but 'inner_mul' is doing
important computation and, as long as 'inner_mul' is compiled
without knowledge of actual parameters, the test should be fair.
My results are:
clang -O3 -march=native 126112us
clang -O3 222136us
clang -O 225855us
gcc -O3 -march=native 82809us
gcc -O3 114365us
gcc -O 287786us
tcc 757347us
There is some irregularity in timings, but this shows that
factor of order 9 is possible.
Notes:
- this code is somewhat hard to vectorize, but clang
and gcc manage to do this,
- vectorized code is sensitive to alignment of the data, some
variation may be due to this,
- modern processors dynamically change clock frequency, the
times seem to be high enough to trigger switch to maximal
frequency (initally I used smaller number of iterations
but timing were less regular),
- most of code is portable, but for timing we need timer with
sufficient resolution, so I use Unix 'gettimeofday'.
-- Waldek Hebisch