Newsportal USENET - Re: Cost of handling misaligned access

Re: Cost of handling misaligned access

Sujet : Re: Cost of handling misaligned access
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.arch
Date : 09. Feb 2025, 01:57:45

Autres entêtes

Organisation : A noiseless patient Spider
Message-ID : <20250209025745.00003df4@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
User-Agent : Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)

On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:

Now it's worse, it's quoted-printable. E.g.:

if (li >=3D len || li <=3D 0)

Some newsreaders can decode this, mine does not.

First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
=20
>
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and
5.15 usec per call on Haswell. You numbers don't even resamble
anything like that.

My cycle numbers are for the whole program that calls keylocks()
100_000 times.

If you divide the cycles by 100000, you get 21954 for clang
keylocks1-256, which is what you expect.

instructions
5_779_542_242 gcc   avx2 1 =20
3_484_942_148 gcc   avx2 2 8=20
5_885_742_164 gcc   avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc   512 1 =20
2_564_142_161 gcc   512 2 32
3_061_042_178 gcc   512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
>
I don't understand these numbers either. For original clang, I'd
expect 25,000 instructions per call.

clang keylocks1-256 performs 79031 instructions per call (divide the
number given by 100000 calls). If you want to see why that is, you
need to analyse the code produced by clang, which I did only for
select cases.

Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.

I just report one run with 100_000 calls, and just hope that the
variation is small:-) In my last refereed paper I use 30 runs and
median, but I don't go to these lengths here; the cycles seem pretty
repeatable.

On the Golden Cove of a Core i3-1315U (compared to the best result
by Terje Mathisen on a Core i7-1365U; the latter can run up to
5.2GHz according to Intel, whereas the former can supposedly run
up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
this time as well):
=20
>
I always thought that NUCs have better cooling than all, but high-end
laptops. Was I wrong? Such slowness is disappointing.

The cooling may be better or not, that does not come into play here,
as it never reaches higher clocks, even when it's cold; E-cores also
stay 700MHz below their rated turbo speed, even when it's the only
loaded core. One theory I have is that one option we set up in the
BIOS has the effect of limiting turbo speed, but it has not been
important enough to test.

5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
>
So, for the best-performing variant IPC of Goldeen Cove is identical
to ancient Haswell?

Actually worse:

For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.

That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?

As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general
latency bottleneck, too. Rocket Lake is also faster on this program
than Haswell and Golden Cove. It seems to be just that this program
rubs Golden Cove the wrong way.

I would have expected the clang keylocks1-256 to run slower,
because the compiler back-end is the same and the 1315U is slower.
Measuring cycles looks more relevant for this benchmark to me
than measuring time, especially on this core where AVX-512 is
disabled and there is no AVX slowdown.
=20
>
I prefer time, because at the end it's the only thing that matter.

True, and certainly, when stuff like AVX-512 license-based
downclocking or thermal or power limits come into play (and are
relevant for the measurement at hand), one has to go there. But then
you can only compare code running on the same kind of machine,
configured the same way. Or maybe just running on the same
machine:-). But then, the generality of the results is questionable.

- anton

Back to original question of the cost of misalignment.
I modified original code to force alignment in the inner loop:

#include <stdint.h>
#include <string.h>

int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li <= 0 || len <= li)
return 0;

int lix = (li + 31) & -32;
_Alignas(32) uint32_t tmp[lix];
memcpy(tmp, keylocks, li*sizeof(*keylocks));
if (lix > li)
memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));

int res = 0;
for (int i = li; i < len; ++i) {
uint32_t lock = keylocks[i];
for (int k = 0; k < lix; ++k)
res += (lock & tmp[k])==0;
}
return res - (lix-li)*(len-li);
}

Compiled with 'clang -O3 -march=haswell'
On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
twice faster than original and only 1.3x slower than horizontally
unrolled variants.

So, at least on Haswell, unaligned AVX256 loads are slow.

Les messages affichés proviennent d'usenet.

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen