Liste des Groupes | Revenir à c arch |
On Tue, 10 Sep 2024 18:03:01 -0000 (UTC)
Brett <ggtgp@yahoo.com> wrote:
Terje Mathisen <terje.mathisen@tmsw.no> wrote:Brett wrote:David Brown <david.brown@hesbynett.no> wrote:Often you get the most efficient results by writing code clearly
and simply so that the compiler can understand it better and good
object code. This is particularly true if you want the same
source to be used on different targets or different variants of a
target - few people can track the instruction scheduling and
timings on multiple processors better than a good compiler. (And
the few people who /can/ do that spend their time chatting in
comp.arch instead of writing code...) When you do hand-made
micro-optimisations, these can work against the compiler and give
poorer results overall.
I know of no example where hand optimized code does worse on a
newer CPU. A newer CPU with bigger OoOe will effectively unroll
your code and schedule it even better.
Not true:
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium
I got it to run at 1.5 clock cycles per character (40 MB/s on a 60
MHz Pentium).
When the PentiumPro came out, it did a 10-20 cycle stall for every
pair of characters, so about an order of magnitude slower in cycle
count. (But only about 3X clock time due to being 200 instead of 60
MHz.)
But how big a slowdown did the unoptimized code get?
Are you describing a glass jaw handling unpredictable branches on a
CPU with a much longer pipeline?
No, the glass jaw of PPro described by Terje is known as partial
register stall.
A shorter pipeline with better worst case handling is going to do
better, even if older. Intel was going for high clock benchmark
speed, not performance.
Typically, PPro was much faster than Pentium clock-for-clock,
especially so when running 32-bit software.
But it had few weak points.
Les messages affichés proviennent d'usenet.