Sujet : Re: Cost of handling misaligned access
De : terje.mathisen (at) *nospam* tmsw.no (Terje Mathisen)
Groupes : comp.archDate : 05. Feb 2025, 18:10:03
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vo061a$2fiql$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0 SeaMonkey/2.53.20
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
As SIMD no longer requires alignment, presumably code no longer does so.
Yes, if you use AVX/AVX2, you don't encounter this particular Intel
stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a problem which lent itself to using 32-bit bitmaps: The task was to check which locks were compatible with which keys, so I ended up with code like this:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel i7), I got code that simply amazed me: The compiler unrolled the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX registers (vpand), then comparing with a zeroed register (vpcmpeqd) (generating -1/0 results) before subtracting (vpsubd) those from 4 accumulators.
This resulted in just 12 instructions to handle 32 tests.
The final code, with zero unsafe/asm/intrinsics, took 5.8 microseconds to run all the needed parsing/setup/initialization and then test 62500 combinations, so just 93 ps per key/lock test!
There was no attempt to check for 32-byte algnment, it all just worked. :-)
The task is of course embarrassingly parallelizable, but I suspect the overhead of starting 4 or 8 threads will be higher than what I would save? I guess I'll have to test!
Terje
-- - <Terje.Mathisen at tmsw.no>"almost all programming can be viewed as an exercise in caching"