Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.arch
Date : 09. Feb 2025, 13:17:44
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20250209141744.00007a4a@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
 
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck? 
 
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general
latency bottleneck, too.  Rocket Lake is also faster on this program
than Haswell and Golden Cove.  It seems to be just that this program
rubs Golden Cove the wrong way.
 

Did you look at the code in the outer loop as well?
The number of iterations in the inner loop is not huge, so excessive
folding of accumulators in the outer loop could be a problem too.
It shouldn't, theoretically, but somehow it could.

And if you still didn't manage to get my source compiled, here is
another version, slightly less clever, but more importantly, formatted
with shorter lines:

#include <stdint.h>
#include <immintrin.h>

#define BROADCAST_u32(p) \
  _mm256_castps_si256(_mm256_broadcast_ss((const float*)(p)))

#define ADD_NZ(acc, x, y) _mm256_sub_epi32(acc, _mm256_cmpeq_epi32 \
  (_mm256_and_si256(x, y), _mm256_setzero_si256()))

int foo_tst(const uint32_t* keylocks, int len, int li)
{
  if (li >= len || li <= 0)
    return 0;
  const uint32_t* px = &keylocks[li];
  unsigned nx = len - li;
  __m256i res0 = _mm256_setzero_si256();
  __m256i res1 = _mm256_setzero_si256();
  __m256i res2 = _mm256_setzero_si256();
  __m256i res3 = _mm256_setzero_si256();

  int nx1 = nx & 31;
  if (nx1) {
    const uint32_t* px_last = &px[nx1];
    // process head, 8 x values per loop
    static const int32_t masks[15] = {
      -1, -1, -1, -1,  -1, -1, -1, -1,
       0,  0,  0,  0,   0,  0,  0,
    };
    int rem0 = (-nx) & 7;
    __m256i mask = _mm256_loadu_si256((const __m256i*)&masks[rem0]);
    __m256i x = _mm256_maskload_epi32((const int*)px, mask);
    px += 8 - rem0;
    const uint32_t* py1 = &keylocks[li & -4];
    const uint32_t* py2 = &keylocks[li];
    for (;;) {
      const uint32_t* py;
      for (py = keylocks; py != py1; py += 4) {
        res0 = ADD_NZ(res0, x, BROADCAST_u32(&py[0]));
        res1 = ADD_NZ(res1, x, BROADCAST_u32(&py[1]));
        res2 = ADD_NZ(res2, x, BROADCAST_u32(&py[2]));
        res3 = ADD_NZ(res3, x, BROADCAST_u32(&py[3]));
      }
      for (; py != py2; py += 1)
        res0 = ADD_NZ(res0, x, BROADCAST_u32(py));
      if (px == px_last)
        break;
      x = _mm256_loadu_si256((const __m256i*)px);
      px += 8;
    }
  }

  int nx2 = nx & -32;
  const uint32_t* px_last = &px[nx2];
  for (; px != px_last; px += 32) {
    __m256i x0 = _mm256_loadu_si256((const __m256i*)&px[0*8]);
    __m256i x1 = _mm256_loadu_si256((const __m256i*)&px[1*8]);
    __m256i x2 = _mm256_loadu_si256((const __m256i*)&px[2*8]);
    __m256i x3 = _mm256_loadu_si256((const __m256i*)&px[3*8]);
    for (const uint32_t* py = keylocks; py != &keylocks[li]; ++py) {
      __m256i y = BROADCAST_u32(py);
      res0 = ADD_NZ(res0, y, x0);
      res1 = ADD_NZ(res1, y, x1);
      res2 = ADD_NZ(res2, y, x2);
      res3 = ADD_NZ(res3, y, x3);
    }
  }
  // fold accumulators
  res0 = _mm256_add_epi32(res0, res2);
  res1 = _mm256_add_epi32(res1, res3);
  res0 = _mm256_add_epi32(res0, res1);
  res0 = _mm256_hadd_epi32(res0, res0);
  res0 = _mm256_hadd_epi32(res0, res0);
  int res = _mm256_extract_epi32(res0, 0)
          + _mm256_extract_epi32(res0, 4);
  return res - (-nx & 7) * li;
}


Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal