Terje Mathisen <
terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
If you have ever learned about vectorization, it's easy to see that
the inner loop can be vectorized. And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
>
I have some (30 years?) experience with auto-vectorization, usually I've
been (very?) disappointed.
I have often been disappointed, too.
As I wrote this was the best I have ever
seen, and the resulting code actually performed extremely close to
theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.
What theory is behind that? If we take the inner loop (from my
version that uses unsigned, not unsigned long) by clang-14.0.6 -O3
-mavx2:
50: c5 f5 db 34 0a vpand (%rdx,%rcx,1),%ymm1,%ymm6
55: c5 f5 db 7c 0a 20 vpand 0x20(%rdx,%rcx,1),%ymm1,%ymm7
5b: c5 75 db 44 0a 40 vpand 0x40(%rdx,%rcx,1),%ymm1,%ymm8
61: c5 75 db 4c 0a 60 vpand 0x60(%rdx,%rcx,1),%ymm1,%ymm9
67: c5 cd 76 f2 vpcmpeqd %ymm2,%ymm6,%ymm6
6b: c5 fd fa c6 vpsubd %ymm6,%ymm0,%ymm0
6f: c5 c5 76 f2 vpcmpeqd %ymm2,%ymm7,%ymm6
73: c5 e5 fa de vpsubd %ymm6,%ymm3,%ymm3
77: c5 bd 76 f2 vpcmpeqd %ymm2,%ymm8,%ymm6
7b: c5 dd fa e6 vpsubd %ymm6,%ymm4,%ymm4
7f: c5 b5 76 f2 vpcmpeqd %ymm2,%ymm9,%ymm6
83: c5 d5 fa ee vpsubd %ymm6,%ymm5,%ymm5
87: 48 83 e9 80 sub $0xffffffffffffff80,%rcx
8b: 48 39 c8 cmp %rcx,%rax
8e: 75 c0 jne 50 <inner+0x50>
I see that clang uses 4 ymm accumulators (ymm0, ymm3, ymm4, ymm5), so
the recurrences are only 1-cycle recurrences for the 4 vsubd
instructions and the sub instruction. So, with enough resources, a
CPU core could perform 1 iteration per cycle (and with hardware
reassociation, even faster, but apart from adding constants in Alder
Lake ff., we are not there yet). But current CPUs do not have that
many resources. If we want to determine the maximum speed given
resource limits, we have to look at the concrete CPU model.
BTW, an alternative would be to do some summation already in each
iteration, but with still only a one-cycle recurrence. This could
have looked like this:
vpand (%rdx,%rcx,1),%ymm1,%ymm6
vpand 0x20(%rdx,%rcx,1),%ymm1,%ymm7
vpand 0x40(%rdx,%rcx,1),%ymm1,%ymm8
vpand 0x60(%rdx,%rcx,1),%ymm1,%ymm9
vpcmpeqd %ymm2,%ymm6,%ymm6
vpcmpeqd %ymm2,%ymm7,%ymm7
vpaddd %ymm6,%ymm7,%ymm7
vpcmpeqd %ymm2,%ymm8,%ymm8
vpcmpeqd %ymm2,%ymm9,%ymm9
vpaddd %ymm8,%ymm9,%ymm9
vpaddd %ymm7,%ymm9,%ymm9
vpsubd %ymm0,%ymm9,%ymm0
sub $0xffffffffffffff80,%rcx
cmp %rcx,%rax
jne 50 <inner+0x50>
Here the SIMD recurrence uses ymm0. This saves having to perform the
summing up of SIMD accumulators after the inner loop. gcc uses
something like that in one of the codes I looked at.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>