Liste des Groupes | Revenir à c arch |
On Wed, 5 Feb 2025 18:10:03 +0100Anton already replied, as he wrote the total loop overhead is just three instructions, all of which can (& will?) overlap with the AVX instructions.
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Anton Ertl wrote:That sounds suboptimal.EricP <ThatWouldBeTelling@thevillage.com> writes:>
As SIMD no longer requires alignment, presumably code no longer>
does so.
Yes, if you use AVX/AVX2, you don't encounter this particular Intel
stupidity.
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up with
code like this:
>
>
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
>
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
>
This resulted in just 12 instructions to handle 32 tests.
>
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison. The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Especially so when data is not aligned.
Les messages affichés proviennent d'usenet.