Newsportal USENET - Re: Cost of handling misaligned access

Re: Cost of handling misaligned access

Sujet : Re: Cost of handling misaligned access
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 06. Feb 2025, 09:52:42

Autres entêtes

Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb6.095242@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7
User-Agent : xrn 10.11

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:

For:
>
unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
{
unsigned long k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}
>
gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o
>
produces 109 lines of disassembly output (which I will spare you),
with a total length of 394 bytes.

...

clang is somewhat better:
>
For the avx2 case, 70 lines and 250 bytes.

I have now taken a closer look at the generated code. Let's first
consider the clang code:

The inner loop works on 16 elements of keylocks[] per iteration, 4 per
SIMD instruction. It uses AVX-128 instructions (with 32-bit elements)
for the (lock & keylocks[k])==0 part, then converts that to 64-bit
elements and continues with AVX-256 instructions for the part1+= part;
it does that by zero-extending the elements, anding them with 1, and
adding them to the accumulators. It would have saved instructions to
sign-extend the elements and subtracting them from the accumulators.

Before the inner loop there is just setup, with no attempt to align
the input. After the inner loop, the results in the 4 ymm registers
are summed up (with maximal latency), then the 4 results inside the
ymm register are summed up, and finally a scalar loop counts the
remaining items.

Now gcc:

The inner loop works on 8 elements of keylocks[] per iteration,
initially with 8 32-bit elements per SIMD instruction (AVX-256), later
with 4 64-bit elements per SIMD instruction. It ands, then
zero-extends (so it needs only one and instead of two for anding after
the extension), ands and adds, it then adds the two result ymm
register to one ymm accumulator (in a latency-minimizing way).

Again, there is not alignment code before the inner loop, just setup.
The elements of the ymm accumulator are summed up into one result.
Then, if there are 4 elements or more remaining, it performs the setup
and SIMD code for working on 4 elements with AVX-128 code, including
after the width expansion; so it uses two AVX-128 instructions for
performing the zero-extension rather thanm one AVX-256 instruction.
It also sets up all the constants in the xmm registers (again, in case
of coming from the inner loop); it finally sums up the result and adds
it to the result of the inner loop. Finally, the last 0-3 iterations
are performed in a scalar loop.

The gcc-12 code is more sophisticated in several respects than the
clang-14 code, but also has several obvious ways to improve it: it
could use sign-extension and subtraction (clang, too); the handling of
the final 4-7 items in gcc could work with 4-element-SIMD throughout
and reuse the contents of the registers that hold constants, and the
result could be added to the SIMD accumulator, summing that up only
afterwards. Maybe the gcc maintainers have performed some of these
improvements in the meantime.

Given the description of Terje Mathisen, my guess is that part1 and li
have 32-bit types, so the C equivalent would be:

unsigned inner(unsigned li, unsigned lock, unsigned keylocks[], unsigned part1)
{
unsigned k;
for (k=0; k<li; k++)
part1 += (lock & keylocks[k])==0;
return part1;
}

With that I see shorter code from both gcc and clang, unsurprising
given the complications that fall away.

I looked also at the code that the compilers produce with
-march=x86-64-v4 (which includes AVX-512) for the latter function.
clang generates code that does not use (512-bit) zmm registers not the
k (predicate) registers. It is different from and longer than the
-mavx2 code, though.

gcc uses both zmm and k registers. The inner loop is quite small:

40:   62 f1 65 48 db 08    vpandd (%rax),%zmm3,%zmm1
46:   48 83 c0 40    add $0x40,%rax
4a:   62 f2 76 48 27 c9    vptestnmd %zmm1,%zmm1,%k1
50:   62 f1 7d c9 6f ca    vmovdqa32 %zmm2,%zmm1{%k1}{z}
56:   62 f1 7d 48 fa c1    vpsubd %zmm1,%zmm0,%zmm0
5c:   48 39 c2 cmp %rax,%rdx
5f:   75 df    jne 40 <inner+0x40>

It works on 16 elements per iteration. It uses vptestnmd and
vmovdqa32 to set zmm1 to -1 (zmm2 is set to -1 in the setup) or 0
(using the {z} option); it's unclear to me why it does not use
vpcmpeqd, which would have produced the same result in one
instruction. Without the intervening extension, the compiler is able
to use a subtration of -1.

The inner loop is preceded by setup (again, no alignment). It is
followed by summing the results and adding them to a 32-bit
accumulator, than an optional use of AVX-256 for 8 elements (and
summing the results and adding them to the accumulator, and finally by
a completely unrolled scalar loop for the last 0-7 elements.

What I would try is to use predication and AVX-512 for the last 1-15
(or maybe 1-16) elements, and only then sum up the results and add it
to part1. I wonder why the gcc-12 developers did not make use of that
possibility (and looking at gcc-14.2 with godbolt, the code for that
apparently is still pretty similar; the clang-19 code has become
shorter, but still does not use zmm or k registers).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Les messages affichés proviennent d'usenet.

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen