Sujet : Re: else ladders practice
De : antispam (at) *nospam* fricas.org (Waldek Hebisch)
Groupes : comp.lang.cDate : 23. Nov 2024, 17:45:47
Autres entêtes
Organisation : To protect and to server
Message-ID : <vht0rp$mmg$1@paganini.bofh.team>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
Bart <
bc@freeuk.com> wrote:
On 22/11/2024 19:29, Waldek Hebisch wrote:
Bart <bc@freeuk.com> wrote:
On 22/11/2024 12:33, Waldek Hebisch wrote:
But, OK, here's the first sizeable benchmark that I thought of (I can't
find a reliable Dhrystone one; perhaps you can post a link).
First Google hit for Dhrystone 2.2a
https://homepages.cwi.nl/~steven/dry.chttps://homepages.cwi.nl/~steven/dry.c
(I used this one).
There was no shortage of them, there were just too many. All seemed to
need some Linux script to compile them, and all needed Linux anyway
because only that has sys/times.h.
I eventually find one for Windows, and that goes to the other extreme
and needs CL (MSVC) with these options:
cl /O2 /D "WIN32" /D "_DEBUG" /D "_CONSOLE" /D "_MBCS" /MD /W4 /Wp64 /Zi
/TP /EHsc /Fa /c dhry264.c dhry_264.c
Plus it uses various ASM routines written MASM syntax. I was partway
through getting it to work with my compiler, when I saw your post.
Your version is much simpler to get going, but still not straightforward
because of 'gettimeofday', which is available via gcc, but is not
exported by msvcrt, which is what tcc and my product use.
I changed it to use clock().
The results then are like this (I tried two sizes of matrix element):
uint32_t uint64_t
gcc -O0 2165 2180 msec
gcc -O3 282 470
tcc 2572 2509
cc 2165 2243
mcc -opt 720 720
The mcc product keeps some local variables in registers, a minor
optimisation I will apply to cc in due course. It's not a priority,
since usually it makes little difference on real applications. Only on
benchmarks like this.
gcc -O3 seems to enable some SIMD instructions, but only for u32. With
u64 elements, then gcc -O3 is only about 50% faster than my compiler.
If I try -march=native, then the 282 sometimes gets down to 235, and the
470 to 420.
(When functions like this were needed in my programs during 80s and 90s,
I used inline assembly. Most code wasn't that critical.)
FYI, ATM is have a version compiling via Lisp, with bounds checking
on it takes 0.58s, with bounds checking off it takes 0.43s
on my machine. The reason to look at C version is to do better.
Taken together, your and my timing indicate that your 'cc' will
give me less speed than going via Lisp. 'mcc -opt' pobably would
give an impovement, but not compared to 'gcc'. BTW, below times
on a slower machine (5 years old cheap laptop):
gcc -O3 -march=native 1722910us
gcc -O3 1720884us
gcc -O 1642328us
tcc 7661992us
via Lisp, checking 5.29s
via Lisp, no checking 4.27s
With -O3 gcc vectorizes inner loops, but apparently on this machine
it backfires and execution time is longer than without vectorization.
In both cases 'tcc' gives slower code than going via Lisp with
array bounds checking on, so ATM using 'tcc' for this application
is rather unattractive.
I may end up using inline assembly, but this is a mess: code for
fast machine will not run on older ones, on some machines
non-vectorized code is faster. So I would need mutiple versions
of assembler just to cover x86_64. And I have other targets.
And this is just one of critical routines. I have probably about
10 such critical routines now and it may grow to about 50.
To get good speed I am experimeting with various variants.
So going assembler way I could be forced to write several
thousends of lines of optimized assembler (most of that to
throw out, but before writing them I would not know which
ones are the best). That would be much more work than just
passing various options to 'gcc' and 'clang' and measuring
execution time.
- most of code is portable, but for timing we need timer with
sufficient resolution, so I use Unix 'gettimeofday'.
Why? Just make the task take long enough.
Well, Windows 'clock' looks OK, but some old style timing routines
have really low resolution and would lead to excessive run
time (I need to run rather large number of tests).
BTW I also ported your program to my 'M' language. The timing however
was about the same as mcc-opt.
The source is below if interested.
AFAICS you have assign-op combinations like 'min:='. ATM I am
undecided about similar operations. I mean, in a language which
like C applies operator only to base types they give some gain.
But I want operators working on large variety of types, and then
it is not clear how to define them.
-- Waldek Hebisch