Liste des Groupes | Revenir à c arch |
Michael S wrote:On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:EricP <ThatWouldBeTelling@thevillage.com> writes:>
As SIMD no longer requires alignment, presumably code no longer>
does so.
Yes, if you use AVX/AVX2, you don't encounter this particular
Intel stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up
with code like this:
>
>
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
>
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
>
This resulted in just 12 instructions to handle 32 tests.
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison. The speed up would depend on
specific microarchiture, but I would guess that at least 1.2x
speedup is here. Especially so when data is not aligned.
Anton already replied, as he wrote the total loop overhead is just
three instructions, all of which can (& will?) overlap with the AVX
instructions.
Due to the combined AVX and 4x unroll, the original scalar code is
alreayd unrolled 32 x, so the loop overhead can mostly be ignored.
If the cpu has enough resources to run more than one 32-byte AVX
instruction per cycle, then the same code will allow all four copies
to run at the same time, but the timing I see on my laptop (93 ps)
corresponds closely to one AVX op/cycle.
Terje
Les messages affichés proviennent d'usenet.