Liste des Groupes | Revenir à c arch |
Terje Mathisen <terje.mathisen@tmsw.no> wrote:The gcc-optimized unix wc was probably still a slower than my glass jaw-hitting asm code: The issue was partial register stalls, where I had been using the relatively tricky concept of interleaving updates to the BL and BH halfs of BX, then using BX to index into a table of combined word and line increments:Brett wrote:But how big a slowdown did the unoptimized code get?David Brown <david.brown@hesbynett.no> wrote:>Often you get the most efficient results by writing code clearly and>
simply so that the compiler can understand it better and good object
code. This is particularly true if you want the same source to be used
on different targets or different variants of a target - few people can
track the instruction scheduling and timings on multiple processors
better than a good compiler. (And the few people who /can/ do that
spend their time chatting in comp.arch instead of writing code...) When
you do hand-made micro-optimisations, these can work against the
compiler and give poorer results overall.
I know of no example where hand optimized code does worse on a newer CPU.
A newer CPU with bigger OoOe will effectively unroll your code and schedule
it even better.
Not true:
>
My favorite benchmark program for 20+ years was Word Count, I
re-optimized that for every new x86 generation, and on the Pentium I got
it to run at 1.5 clock cycles per character (40 MB/s on a 60 MHz Pentium).
>
When the PentiumPro came out, it did a 10-20 cycle stall for every pair
of characters, so about an order of magnitude slower in cycle count.
(But only about 3X clock time due to being 200 instead of 60 MHz.)
Are you describing a glass jaw handling unpredictable branches on a CPUPRS stalls was the single largest glass jaw on the PentiumPro, but it was very rare in compiled code.
with a much longer pipeline?
Les messages affichés proviennent d'usenet.