Newsportal USENET - Re: Cost of handling misaligned access

Thanks Michael, your code looks similar to what I wrote when I tried to use intrinsics.
I also tried to extend both the locks and keys from 250 to 256, so that there would be zero tail overhead, but that was always slightly slower.
My 4-wide is pretty much the same as your core code below.
Terje
Michael S wrote:

On Fri, 7 Feb 2025 15:23:51 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:
On Fri, 7 Feb 2025 11:06:43 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

Michael S wrote:
On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
BTW, when I timed 1000 calls to that 5-6 us program, to get
around teh 100 ns timer resolution, each iteration ran in 5.23
us.
>
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of
21.
>
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
>
>
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
>
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
CPUID.
>
Is that sufficient to judge the performance?
>
Terje

>
Not really.
i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
"medium".
>
OK. It seems like the big cores are similar to what I've had
previously, i.e. each core supports hyperthreading, while the medium
ones don't. This results in 12 HW threads.
>
As I said above, on such CPU I wouldn't believe the numbers before
total duration of test is 10 seconds and the test run is locked to
particular core. As to 5 msec per measurement, that's enough, but
why not do longer measurements if you have to run for 10 sec
anyway?
>
The Advent of Code task required exactly 250 keys and 250 locks to be
tested, this of course fits easily in a corner of $L1 (2000 bytes).
>
The input file to be parsed was 43*500 = 21500 bytes long, so this
should also fit in $L1 when I run repeated tests.
>
Under Windows I can set thread affinity to lock a process to a given
core, but how do I know which are "Big" and "Medium"?
Trial and error?
I think, big cores/threads tend to be with lower numbers, but I am not
sure it is universal.

>
Terje
>
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
So, my measurements, similarly to your measurements, demonstrate that
clang autovectorized code looks good, but performs not too good.
Here is my manual code. Handling of the tail is too clever. I did not
have time to simplify. Otherwise, for 250x250 it should perform about
the same as simpler code.
#include <stdint.h>
#include <immintrin.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
   if (li >= len || li <= 0)
   return 0;
   const uint32_t* keyx = &keylocks[li];
   unsigned ni = len - li;
   __m256i res0 = _mm256_setzero_si256();
   __m256i res1 = _mm256_setzero_si256();
   __m256i res2 = _mm256_setzero_si256();
   __m256i res3 = _mm256_setzero_si256();
   const uint32_t* keyx_last = &keyx[ni & -32];
   for (; keyx != keyx_last; keyx += 32) {
   __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
   __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
   __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
   __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
   // for (int k = 0; k < li; ++k) {
   // for (int k = 0, nk = li; nk > 0; ++k, --nk) {
   for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { // __m256i lockk =
_mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyy)); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
   for (int k = 0; k < li % 32; ++k)
   tmp[k] = keyy_last[k];
   for (int k = li % 32; k < 32; ++k)
   tmp[k] = (uint32_t)-1;
   }
   const uint32_t* keyx_last = &keyx[ni % 32];
   int nz = 0;
   for (; keyx != keyx_last; keyx += 1) {
   if (*keyx) {
   __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const
__m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const
__m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const
__m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 =
_mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 =
_mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 =
_mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 =
_mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
_mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
   // fold accumulators
   res0 = _mm256_add_epi32(res0, res2);
   res1 = _mm256_add_epi32(res1, res3);
   res0 = _mm256_add_epi32(res0, res1);
   res0 = _mm256_hadd_epi32(res0, res0);
   res0 = _mm256_hadd_epi32(res0, res0);
res += _mm256_extract_epi32(res0, 0);
   res += _mm256_extract_epi32(res0, 4);
   return res;
}

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen