Michael S <
already5chosen@yahoo.com> writes:
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
>
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
>
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
In the meantime, I also wrote the original code in plain C
(keylocks1.c), then implemented your idea of unrolling the outer loop
and comparing a subarray of locks to each key (is this called strip
mining?) in plain C (with the hope that auto-vectorization works) as
keylocks2.c, and finally rewrote the latter version to use gcc vector
extensions (keylocks3.c). I wrote a dummy main around that that calls
the routine 100_000 times; given that the original routine's
performance does not depend on the data, and I used non-0 keys (so
keylocks[23].c does not skip any keys), the actual data is not
important.
You can find the source code and the binaries I measured at
<
http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were
compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
the clang compilations sometimes used different UNROLL factors than
the gcc compilations (and I am unsure, which, see below).
The original code is:
unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[], unsigned nlocks)
{
unsigned i, j;
unsigned part1 = 0;
for (i=0; i<nlocks; i++) {
unsigned lock = locks[i];
for (j=0; j<nkeys; j++)
part1 += (lock & keys[j])==0;
}
return part1;
}
For keylocks2.c the central loops are:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
for (j=0; j<nkeys1; j++) {
unsigned key = keys1[j];
for (k=0; k<UNROLL; k++)
part0[k] += (locks1[i+k] & key)==0;
}
}
For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
AVX-512; the numbers below are for those factors that produce the
lowest cycles on the Rocket Lake machine.
The central loops are preceded by code to arrange the data such that
this code works: locks are copied to the longer locks1; the length of
locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
to increase the count by 0) and the keys are copies to keys1 (with 0
removed so that the extra locks are not counted, and that also may
increase efficiency if there is a key=0). The central loops are
followed by summing up the elements of part0.
keylocks3.c, which uses the gcc vector extensions, just changes
keylocks2.c in a few places. In particular, it adds a type vu:
typedef unsigned vu __attribute__ ((vector_size (UNROLL*sizeof(unsigned))));
The central loops now look as follows:
for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
vu lock = *(vu *)(locks1+i);
for (j=0; j<nkeys1; j++) {
part0 -= (lock & keys1[j])==0;
}
One interesting aspect of the gcc vector extensions is that the result
of comparing two vectors is 0 (false) or ~0 (true) (per element),
whereas for scalars the value for true is 1. Therefore the code above
updates part0 with -=, whereas in keylocks2.c += is used.
While the use of ~0 is a good choice when designing a new programming
language, I would have gone for 1 in the case of a vector extension
for C, for consistency with the scalar case; in combination with
hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
means that the compiler will introduce a negation in its intermediate
representation at some point; I expect that compilers will usually be
able to optimize this negation away, but I would not be surprised at
cases where my expectation is disappointed.
keylocks3.c compiles without warning on clang, but the result usually
segfaults (but sometime does not, e.g., in the timed run on Zen4; it
segfaults in other runs on Zen4). I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.
I would have liked to compare the performance of my code against your
code, but your code apparently was destroyed by arbitrary line
breaking in your news-posting software. Anyway, here are my results.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
The cores are:
Haswell: Core i7-4790K (similar to Michael S.'s CPU)
Golden Cove: Core i3-1315U (same P-core as Terje Mathisen's laptop)
Gracemont: Core i3-1315U (same E-core as Terje Mathisen's laptop)
Rocket Lake: Xeon W-1370P (2021 Intel CPU)
Zen4: Ryzen 7 8700G (2024)
I also measured Tiger Lake (Core i5-1135G7, the CPU of a 2021 laptop),
but the results were very close to the Rocket Lake results, so because
of the limited table width, I do not show them.
The first three cores do not support AVX-512, the others do.
Cycles:
Haswell Golden Cove Gracemont Rocket Lake Zen4
1818_241431 1433_208539 1778_675623 2_365_664_737 1_677_853_186 gcc avx2 1
1051_191216 1189_869807 1872_856423 981_948_517 727_418_069 gcc avx2 2 8
1596_783872 1213_400891 2076_677426 1_690_280_182 913_746_088 gcc avx2 3 8
2195_438821 1638_006130 2577_451872 2_291_743_879 1_617_970_157 clang avx2 1
2757_454335 2151_198125 2506_427284 3_174_899_185 1_523_870_829 clang avx2 2 8?
638_374_463 clang avx2 3 8?
1_139_175_457 1_219_164_672 gcc 512 1
856_818_642 900_108_135 gcc 512 2 32
866_077_594 1_072_172_449 gcc 512 3 16
2_479_213_408 1_479_937_930 clang 512 1
912_273706 936_311567 847_289_380 634_826_441 clang 512 2 16?
636_278_210 clang 512 3 16?
avx2 means: compiled with -mavx2; 512 means: compiled with
-march=x86-64-v4 (I usually did not measure those on machines that do
not support AVX-512, because I expected the results to not work; I
later measured some clang's keylocks2-512 on some of those machines).
The number behind that ist the keylocks[123].c variant, and the number
behind that (if present) the UNROLL parameter. I am not sure about
the UNROLL numbers used for clang, but in any case I kept what
performed best on Rocket Lake. The number of instructions executed is
(reported on the Zen4):
instructions
5_779_542_242 gcc avx2 1
3_484_942_148 gcc avx2 2 8
5_885_742_164 gcc avx2 3 8
7_903_138_230 clang avx2 1
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?
4_204_442_194 gcc 512 1
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
while microarchitecture descriptions report only a 6-wide renamer
<
https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
My guess is that the front end combined some instructions (maybe
compare and branch) into a macro-op, and the renamer then processed 6
macro-ops that represented more instructions. The inner loop is
│190: vpbroadcastd (%rax),%ymm0
1.90 │ add $0x4,%rax
│ vpand %ymm2,%ymm0,%ymm0
1.09 │ vpcmpeqd %ymm3,%ymm0,%ymm0
0.41 │ vpsubd %ymm0,%ymm1,%ymm1
78.30 │ cmp %rdx,%rax
│ jne 190
and if the cmp and jne are combined into one macro-op, that would be
perfect for executing one iteration per cycle.
It's interesting that gcc's keylocks2-256 results on far fewer
instructions (and eventually, cycles). It unrolls the inner loop 8
times to process the keys in SIMD fashion, too, loading the keys one
ymm register at a time. In order to do that it arranges the locks in
8 different ymm registers in the outer loop, so the inner loop
performs 8 sequences similar to
vpand %ymm0,%ymm15,%ymm2
vpcmpeqd %ymm1,%ymm2,%ymm2
vpsubd %ymm2,%ymm4,%ymm4
surrounded by
300: vmovdqu (%rsi),%ymm0
add $0x20,%rsi
[8 3-instruction sequences]
cmp %rsi,%rdx
jne 300
It also uses 8 ymm accumulators, so not all of that fits into
registers, so three of the anded values are stored on the stack. For
Zen4 this could be improved by using only 2 accumulators. In any
case, the gcc people did something clever here, and I do not
understand how they got there from the source code, and why they did
not get there from keylocks1.c.
For clang's keylocks3-256 the inner loop and the outer loop are each
unrolled two times, resulting in and inner loop like:
190: vpbroadcastd (%r12,%rbx,4),%ymm5
vpand %ymm3,%ymm5,%ymm6
vpand %ymm4,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
vpand %ymm4,%ymm5,%ymm6
vpand %ymm3,%ymm5,%ymm5
vpcmpeqd %ymm1,%ymm5,%ymm5
vpsubd %ymm5,%ymm0,%ymm0
vpcmpeqd %ymm1,%ymm6,%ymm5
vpsubd %ymm5,%ymm2,%ymm2
add $0x2,%rbx
cmp %rbx,%rsi
jne 190
This results in the lowest AVX2 cycles, and I expect that one can use
that approach without crash problems without adding too many cycles.
The clang -march=x86-64-v4 results have similar code (with twice as
much inner-loop unrolling in case of keylocks3-512), but they all only
use AVX2 instructions and there have been successful runs on a Zen2
(which does not support AVX-512). It seems that clang does not
support AVX-512, or it does not understand -march=x86-64-v4 to allow
more than AVX2.
The least executed instructions is with gcc's keylocks2-512, where the
inner loop is:
230: vpbroadcastd 0x4(%rax),%zmm4
vpbroadcastd (%rax),%zmm0
mov %edx,%r10d
add $0x8,%rax
add $0x2,%edx
vpandd %zmm4,%zmm8,%zmm5
vpandd %zmm0,%zmm8,%zmm9
vpandd %zmm4,%zmm6,%zmm4
vptestnmd %zmm5,%zmm5,%k1
vpandd %zmm0,%zmm6,%zmm0
vmovdqa32 %zmm7,%zmm5{%k1}{z}
vptestnmd %zmm9,%zmm9,%k1
vmovdqa32 %zmm3,%zmm9{%k1}{z}
vptestnmd %zmm4,%zmm4,%k1
vpsubd %zmm9,%zmm5,%zmm5
vpaddd %zmm5,%zmm2,%zmm2
vmovdqa32 %zmm7,%zmm4{%k1}{z}
vptestnmd %zmm0,%zmm0,%k1
vmovdqa32 %zmm3,%zmm0{%k1}{z}
vpsubd %zmm0,%zmm4,%zmm0
vpaddd %zmm0,%zmm1,%zmm1
cmp %r10d,%r8d
jne 230
Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
loop at a time, and the inner loop is unrolled by a factor of 2, too.
It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd
(why?). Anyway, the code seems to rub Zen4 the wrong way, and it
performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
performs slightly better, but still, the clang code for keylocks2-512
runs a bit faster without using AVX-512.
I also saw one case where the compiler botched it:
gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c
[/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e instructions keylocks3-256
603800000
Performance counter stats for 'keylocks3-256':
17_476_700_581 cycles
39_480_242_683 instructions # 2.26 insn per cycle
3.506995312 seconds time elapsed
3.507020000 seconds user
0.000000000 seconds sys
(cycles and timings on the 8700G). Here the compiler failed to
vectorize the comparison, and performed them using scalar instructions
(first extracting the data from the SIMD registers, and finally
inserting the result into SIMD registers, with additional overhead
from spilling registers). The result requires about 10 times more
instructions than the UNROLL=8 variant and almost 20 times more
cycles.
On to timings per routine invocation:
On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized code)
2.08us clang keylocks2-512 (2.18us for Michael S.'s manual vectorized code)
Michael S.'s "original code" performs similar on clang to my
keylocks1.c. clang's keylocks2-512 code is quite competetive with his
manual code.
On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
5.25us Terje Mathisen's Rust code compiled by clang (best on the 1365U)
4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>