Terje Mathisen <
terje.mathisen@tmsw.no> writes:
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
Does Rust only have this roundabout way to express this sequentially?
In Forth I would express that scalarly as
( part1 ) li 0 do
keylocks i th @ lock and 0= - loop
["-" because 0= produces all-bits-set (-1) for true]
or in C as
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
which I find much easier to follow. I also expected 0..li to include
li (based on, I guess, the of .. in Pascal and its descendents), but
the net tells me that it does not (starting with 0 was the hint that
made me check my expectations).
Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel
i7)
I find it deplorable that even knowledgeable people use marketing
labels like "i7" which do not tell anything technical (and very little
non-technical) rather than specifying the full model number (e.g, Core
i7-1270P) or the design (e.g., Alder Lake). But in the present case
"AVX2-capable CPU" is enough information.
I got code that simply amazed me: The compiler unrolled the inner
loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX
registers (vpand), then comparing with a zeroed register (vpcmpeqd)
(generating -1/0 results) before subtracting (vpsubd) those from 4
accumulators.
If you have ever learned about vectorization, it's easy to see that
the inner loop can be vectorized. And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
But if you have learned about vectorization, you will find that you
will see ways to vectorize code, but that many programming languages
don't offer ways to express the vectorization directly. Instead, you
write the code as scalar code and hope that the auto-vectorizer
actually vectorizes it. If it does not, there is no indication how
you can get the compiler to auto-vectorize.
Even for Fortran, where the array sublanguage has vector semantics
within expressions (maybe somebody can show code for the example
above), Thomas Koenig tells us that his gcc front end produces scalar
IR code from that and then relies on auto-vectorization to undo the
scalarization.
There was no attempt to check for 32-byte algnment, it all just worked. :-)
When I try this stuff with gcc and it actually succeeds at
auto-vectorization, the result tends to be very long, and it's also
the case here:
For:
unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
{
unsigned long k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}
gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o
produces 109 lines of disassembly output (which I will spare you),
with a total length of 394 bytes. When I ask for AVX-512 with
gcc -Wall -O3 -march=x86-64-v4 -c x.c && objdump -d x.o
it's even worse: 139 lines and 538 bytes. My impression is that gcc
tries to align the main loop to 32-byte (for AVX2) or 64-byte
boundaries and generates lots of code around the main loop in order to
get there.
Which somewhat leads us back to the topic of the thread. I wonder if
the alignment really helps for this loop, if so, how much, and how
many iterations are necessary to amortize the overhead. But I am too
lazy to measure it.
clang is somewhat better:
For the avx2 case, 70 lines and 250 bytes.
For the x86-64-v4 case, 111 lines and 435 byes.
The versions used are gcc-12.2.0 and clang-14.0.6.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>