Liste des Groupes | Revenir à c arch |
Terje Mathisen <terje.mathisen@tmsw.no> writes:I could have written it asfor k in 0..li {Does Rust only have this roundabout way to express this sequentially?
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
In Forth I would express that scalarly as
( part1 ) li 0 do
keylocks i th @ lock and 0= - loop
["-" because 0= produces all-bits-set (-1) for true]
or in C as
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
which I find much easier to follow. I also expected 0..li to include:-)
li (based on, I guess, the of .. in Pascal and its descendents), but
the net tells me that it does not (starting with 0 was the hint that
made me check my expectations).
I have some (30 years?) experience with auto-vectorization, usually I've been (very?) disappointed. As I wrote this was the best I have ever seen, and the resulting code actually performed extremely close to theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.Telling the rust compiler to target my AVX2-capable laptop CPU (an IntelI find it deplorable that even knowledgeable people use marketing
i7)
labels like "i7" which do not tell anything technical (and very little
non-technical) rather than specifying the full model number (e.g, Core
i7-1270P) or the design (e.g., Alder Lake). But in the present case
"AVX2-capable CPU" is enough information.
I got code that simply amazed me: The compiler unrolled the innerIf you have ever learned about vectorization, it's easy to see that
loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX
registers (vpand), then comparing with a zeroed register (vpcmpeqd)
(generating -1/0 results) before subtracting (vpsubd) those from 4
accumulators.
the inner loop can be vectorized. And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
clang is somewhat better:Rustc sits on top of the clang infrastucture, even with that 32-way unroll it was quite compact. I did not count, but your 70 lines seems to be in the ballpark.
For the avx2 case, 70 lines and 250 bytes.
For the x86-64-v4 case, 111 lines and 435 byes.
Les messages affichés proviennent d'usenet.