anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
For:
>
unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
{
unsigned long k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}
>
gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o
>
produces 109 lines of disassembly output (which I will spare you),
with a total length of 394 bytes.
...
clang is somewhat better:
>
For the avx2 case, 70 lines and 250 bytes.
I have now taken a closer look at the generated code. Let's first
consider the clang code:
The inner loop works on 16 elements of keylocks[] per iteration, 4 per
SIMD instruction. It uses AVX-128 instructions (with 32-bit elements)
for the (lock & keylocks[k])==0 part, then converts that to 64-bit
elements and continues with AVX-256 instructions for the part1+= part;
it does that by zero-extending the elements, anding them with 1, and
adding them to the accumulators. It would have saved instructions to
sign-extend the elements and subtracting them from the accumulators.
Before the inner loop there is just setup, with no attempt to align
the input. After the inner loop, the results in the 4 ymm registers
are summed up (with maximal latency), then the 4 results inside the
ymm register are summed up, and finally a scalar loop counts the
remaining items.
Now gcc:
The inner loop works on 8 elements of keylocks[] per iteration,
initially with 8 32-bit elements per SIMD instruction (AVX-256), later
with 4 64-bit elements per SIMD instruction. It ands, then
zero-extends (so it needs only one and instead of two for anding after
the extension), ands and adds, it then adds the two result ymm
register to one ymm accumulator (in a latency-minimizing way).
Again, there is not alignment code before the inner loop, just setup.
The elements of the ymm accumulator are summed up into one result.
Then, if there are 4 elements or more remaining, it performs the setup
and SIMD code for working on 4 elements with AVX-128 code, including
after the width expansion; so it uses two AVX-128 instructions for
performing the zero-extension rather thanm one AVX-256 instruction.
It also sets up all the constants in the xmm registers (again, in case
of coming from the inner loop); it finally sums up the result and adds
it to the result of the inner loop. Finally, the last 0-3 iterations
are performed in a scalar loop.
The gcc-12 code is more sophisticated in several respects than the
clang-14 code, but also has several obvious ways to improve it: it
could use sign-extension and subtraction (clang, too); the handling of
the final 4-7 items in gcc could work with 4-element-SIMD throughout
and reuse the contents of the registers that hold constants, and the
result could be added to the SIMD accumulator, summing that up only
afterwards. Maybe the gcc maintainers have performed some of these
improvements in the meantime.
Given the description of Terje Mathisen, my guess is that part1 and li
have 32-bit types, so the C equivalent would be:
unsigned inner(unsigned li, unsigned lock, unsigned keylocks[], unsigned part1)
{
unsigned k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}
With that I see shorter code from both gcc and clang, unsurprising
given the complications that fall away.
I looked also at the code that the compilers produce with
-march=x86-64-v4 (which includes AVX-512) for the latter function.
clang generates code that does not use (512-bit) zmm registers not the
k (predicate) registers. It is different from and longer than the
-mavx2 code, though.
gcc uses both zmm and k registers. The inner loop is quite small:
40: 62 f1 65 48 db 08 vpandd (%rax),%zmm3,%zmm1
46: 48 83 c0 40 add $0x40,%rax
4a: 62 f2 76 48 27 c9 vptestnmd %zmm1,%zmm1,%k1
50: 62 f1 7d c9 6f ca vmovdqa32 %zmm2,%zmm1{%k1}{z}
56: 62 f1 7d 48 fa c1 vpsubd %zmm1,%zmm0,%zmm0
5c: 48 39 c2 cmp %rax,%rdx
5f: 75 df jne 40 <inner+0x40>
It works on 16 elements per iteration. It uses vptestnmd and
vmovdqa32 to set zmm1 to -1 (zmm2 is set to -1 in the setup) or 0
(using the {z} option); it's unclear to me why it does not use
vpcmpeqd, which would have produced the same result in one
instruction. Without the intervening extension, the compiler is able
to use a subtration of -1.
The inner loop is preceded by setup (again, no alignment). It is
followed by summing the results and adding them to a 32-bit
accumulator, than an optional use of AVX-256 for 8 elements (and
summing the results and adding them to the accumulator, and finally by
a completely unrolled scalar loop for the last 0-7 elements.
What I would try is to use predication and AVX-512 for the last 1-15
(or maybe 1-16) elements, and only then sum up the results and add it
to part1. I wonder why the gcc-12 developers did not make use of that
possibility (and looking at gcc-14.2 with godbolt, the code for that
apparently is still pretty similar; the clang-19 code has become
shorter, but still does not use zmm or k registers).
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>