Liste des Groupes | Revenir à c arch |
On Fri, 7 Feb 2025 15:23:51 +0100--
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:Trial and error?On Fri, 7 Feb 2025 11:06:43 +0100>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:>On Thu, 6 Feb 2025 21:36:38 +0100>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:BTW, when I timed 1000 calls to that 5-6 us program, to get>
around teh 100 ns timer resolution, each iteration ran in 5.23
us.
That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K
calls in each measurement (~10 sec total) and report median of
21.
Each measurement did 1000 calls, then I ran 100 such measurements.
The 5.23 us value was the lowest seen among the 100, with average a
bit more:
>
>
Slowest: 9205200 ns
Fastest: 5247500 ns
Average: 5672529 ns/iter
Part1 = 3338
>
My own (old, but somewhat kept up to date) cputype program reported
that it is a "13th Gen Intel(R) Core(TM) i7-1365U" according to
CPUID.
>
Is that sufficient to judge the performance?
>
Terje
Not really.
i7-1365U is a complicated beast. 2 "big" cores, 8 "medium" cores.
Frequency varies ALOT, 1.8 to 5.2 GHz on "big", 1.3 to 3.9 GHz on
"medium".
OK. It seems like the big cores are similar to what I've had
previously, i.e. each core supports hyperthreading, while the medium
ones don't. This results in 12 HW threads.
>As I said above, on such CPU I wouldn't believe the numbers before>
total duration of test is 10 seconds and the test run is locked to
particular core. As to 5 msec per measurement, that's enough, but
why not do longer measurements if you have to run for 10 sec
anyway?
The Advent of Code task required exactly 250 keys and 250 locks to be
tested, this of course fits easily in a corner of $L1 (2000 bytes).
>
The input file to be parsed was 43*500 = 21500 bytes long, so this
should also fit in $L1 when I run repeated tests.
>
Under Windows I can set thread affinity to lock a process to a given
core, but how do I know which are "Big" and "Medium"?
I think, big cores/threads tend to be with lower numbers, but I am not
sure it is universal.
>In the mean time.
Terje
>
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18
So, my measurements, similarly to your measurements, demonstrate that
clang autovectorized code looks good, but performs not too good.
Here is my manual code. Handling of the tail is too clever. I did not
have time to simplify. Otherwise, for 250x250 it should perform about
the same as simpler code.
#include <stdint.h>
#include <immintrin.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li >= len || li <= 0)
return 0;
const uint32_t* keyx = &keylocks[li];
unsigned ni = len - li;
__m256i res0 = _mm256_setzero_si256();
__m256i res1 = _mm256_setzero_si256();
__m256i res2 = _mm256_setzero_si256();
__m256i res3 = _mm256_setzero_si256();
const uint32_t* keyx_last = &keyx[ni & -32];
for (; keyx != keyx_last; keyx += 32) {
__m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
__m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
__m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
__m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
// for (int k = 0; k < li; ++k) {
// for (int k = 0, nk = li; nk > 0; ++k, --nk) {
for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { // __m256i lockk =
_mm256_castps_si256(_mm256_broadcast_ss((const float*)&keylocks[k]));
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyy)); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32]; if (li % 32) {
for (int k = 0; k < li % 32; ++k)
tmp[k] = keyy_last[k];
for (int k = li % 32; k < 32; ++k)
tmp[k] = (uint32_t)-1;
}
const uint32_t* keyx_last = &keyx[ni % 32];
int nz = 0;
for (; keyx != keyx_last; keyx += 1) {
if (*keyx) {
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const
__m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const
__m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const
__m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 =
_mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 =
_mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 =
_mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 =
_mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 =
_mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
// fold accumulators
res0 = _mm256_add_epi32(res0, res2);
res1 = _mm256_add_epi32(res1, res3);
res0 = _mm256_add_epi32(res0, res1);
res0 = _mm256_hadd_epi32(res0, res0);
res0 = _mm256_hadd_epi32(res0, res0);
res += _mm256_extract_epi32(res0, 0);
res += _mm256_extract_epi32(res0, 4);
return res;
}
Les messages affichés proviennent d'usenet.