On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general
latency bottleneck, too. Rocket Lake is also faster on this program
than Haswell and Golden Cove. It seems to be just that this program
rubs Golden Cove the wrong way.
Did you look at the code in the outer loop as well?
The number of iterations in the inner loop is not huge, so excessive
folding of accumulators in the outer loop could be a problem too.
It shouldn't, theoretically, but somehow it could.
And if you still didn't manage to get my source compiled, here is
another version, slightly less clever, but more importantly, formatted
with shorter lines:
#include <stdint.h>
#include <immintrin.h>
#define BROADCAST_u32(p) \
_mm256_castps_si256(_mm256_broadcast_ss((const float*)(p)))
#define ADD_NZ(acc, x, y) _mm256_sub_epi32(acc, _mm256_cmpeq_epi32 \
(_mm256_and_si256(x, y), _mm256_setzero_si256()))
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li >= len || li <= 0)
return 0;
const uint32_t* px = &keylocks[li];
unsigned nx = len - li;
__m256i res0 = _mm256_setzero_si256();
__m256i res1 = _mm256_setzero_si256();
__m256i res2 = _mm256_setzero_si256();
__m256i res3 = _mm256_setzero_si256();
int nx1 = nx & 31;
if (nx1) {
const uint32_t* px_last = &px[nx1];
// process head, 8 x values per loop
static const int32_t masks[15] = {
-1, -1, -1, -1, -1, -1, -1, -1,
0, 0, 0, 0, 0, 0, 0,
};
int rem0 = (-nx) & 7;
__m256i mask = _mm256_loadu_si256((const __m256i*)&masks[rem0]);
__m256i x = _mm256_maskload_epi32((const int*)px, mask);
px += 8 - rem0;
const uint32_t* py1 = &keylocks[li & -4];
const uint32_t* py2 = &keylocks[li];
for (;;) {
const uint32_t* py;
for (py = keylocks; py != py1; py += 4) {
res0 = ADD_NZ(res0, x, BROADCAST_u32(&py[0]));
res1 = ADD_NZ(res1, x, BROADCAST_u32(&py[1]));
res2 = ADD_NZ(res2, x, BROADCAST_u32(&py[2]));
res3 = ADD_NZ(res3, x, BROADCAST_u32(&py[3]));
}
for (; py != py2; py += 1)
res0 = ADD_NZ(res0, x, BROADCAST_u32(py));
if (px == px_last)
break;
x = _mm256_loadu_si256((const __m256i*)px);
px += 8;
}
}
int nx2 = nx & -32;
const uint32_t* px_last = &px[nx2];
for (; px != px_last; px += 32) {
__m256i x0 = _mm256_loadu_si256((const __m256i*)&px[0*8]);
__m256i x1 = _mm256_loadu_si256((const __m256i*)&px[1*8]);
__m256i x2 = _mm256_loadu_si256((const __m256i*)&px[2*8]);
__m256i x3 = _mm256_loadu_si256((const __m256i*)&px[3*8]);
for (const uint32_t* py = keylocks; py != &keylocks[li]; ++py) {
__m256i y = BROADCAST_u32(py);
res0 = ADD_NZ(res0, y, x0);
res1 = ADD_NZ(res1, y, x1);
res2 = ADD_NZ(res2, y, x2);
res3 = ADD_NZ(res3, y, x3);
}
}
// fold accumulators
res0 = _mm256_add_epi32(res0, res2);
res1 = _mm256_add_epi32(res1, res3);
res0 = _mm256_add_epi32(res0, res1);
res0 = _mm256_hadd_epi32(res0, res0);
res0 = _mm256_hadd_epi32(res0, res0);
int res = _mm256_extract_epi32(res0, 0)
+ _mm256_extract_epi32(res0, 4);
return res - (-nx & 7) * li;
}