On 11/2/2024 11:01 AM, Martin Brown wrote:
Most compilers these days are smart enough to move loop invariants outside of a loop and then dispose of the loop. You must have side effects in any code that you want to benchmark. Optimisers can be *really* smart about rearranging code for maximum performance by avoiding pipeline stalls. Only the very best humans can match them now.
You need a few more asterisks stressing "really"! What's most amusing
is how the folks who write "clever"/obscure code fragments THINKING they
are "optimizing" it just annoy the compiler. On any substantial piece
of code, "you" simply can't outperform it. Your mind gets tired. You
make mistakes. The compiler just plows ahead. EVERY TIME IT IS INVOKED!
Every now and then you stumble upon a construct that on certain platforms is unreasonably fast (2x or 4x). Increasingly because it has vectorised a loop on the fly when all go faster stripes are enabled.
A developer /aware of the platform on which the code will execute/ can often
design a better *algorithm* to beat the compiler's optimizations of a poorer
algorithm.
I spend a lot of time thinking about how my data is structured to exploit
features present in the hardware. E.g., a traditional mind would group
all of a task's state into a single struct. But, that will almost certainly
span a couple of cache lines.
So, when looking at scheduling decisions, the *processor* will be flitting
around between many cache lines -- examining *a* piece of data in each.
So, more trips to memory to fill those other cache lines just to examine
that one datum in each cache line "wasted" on it's fetch.
Instead, group the parameters from MANY tasks in such a way that the
examination of the datum for the first task drags similar data into
that cache line for the *next* task's parameters; leverage the
effort already expended on THAT cache line instead of just (likely)
discarding it in favor of fetching another line.
So, instead of just knowing, e.g., when to use a particular type of
search or sort algorithm (based on a characterization of the data to
be searched/sorted), you think about the "hardware algorithm" that
your code is invoking to support whatever your hardware is doing.
Note how large caches have become on modern processors. And, the wasted
opportunities they represent for multithreaded implementations ("Gee,
all of that data in the cache that I thought I could make use of is now
largely useless as the next task isn't likely to benefit from it!")
[Another argument affecting the choice of implementation languages;
locality of reference. Stack computers, anyone??]
Precision timers and benchmarking tools are available on most platforms no need to use a stop watch unless you enjoy watching paint dry.
This is complicated if you have a interrupts that can nickel and
dime your execution time.
Benchmarking, in general, is fraught with perils for the naive.
Few truly understand the scope of their *select* "micro-optimizations".
In the 80's, there was an ongoing split in the Motogorilla vs. Inhell
camps over *system* designs. Folks would make arbitrary claims
(backed up with REAL data) to support why X was better than Y.
But, they rarely looked at the whole picture. And, a product *is*
"the whole picture".
"Yeah, its nice that one opcode fetch allows you to push/pull
*all* of the processor state (vs. having to fetch an instruction
to push/pop each individual item). But, your code is more than
just push/pull operations. If you are constantly going back to
memory to store the (temporary?) result of the last instruction,
then having internal state that can be used to eliminate that
memory access allows the *process* to run faster. 'Let's hold
memory DOLLARS constant and see how well things perform...'"
[Gee, this processor runs at 8MHz while this other runs at 2...
which is MORE PRODUCTIVE?]