Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 06. Feb 2025, 09:52:42
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb6.095242@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7
User-Agent : xrn 10.11
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
For:
>
unsigned long inner(unsigned long li, unsigned lock, unsigned keylocks[], unsigned long part1)
{
 unsigned long k;
 for (k=0; k<li; k++)
   part1 += (lock & keylocks[k])==0;
 return part1;
}
>
gcc -Wall -O3 -mavx2 -c x.c && objdump -d x.o
>
produces 109 lines of disassembly output (which I will spare you),
with a total length of 394 bytes.
...
clang is somewhat better:
>
For the avx2 case, 70 lines and 250 bytes.

I have now taken a closer look at the generated code.  Let's first
consider the clang code:

The inner loop works on 16 elements of keylocks[] per iteration, 4 per
SIMD instruction.  It uses AVX-128 instructions (with 32-bit elements)
for the (lock & keylocks[k])==0 part, then converts that to 64-bit
elements and continues with AVX-256 instructions for the part1+= part;
it does that by zero-extending the elements, anding them with 1, and
adding them to the accumulators.  It would have saved instructions to
sign-extend the elements and subtracting them from the accumulators.

Before the inner loop there is just setup, with no attempt to align
the input.  After the inner loop, the results in the 4 ymm registers
are summed up (with maximal latency), then the 4 results inside the
ymm register are summed up, and finally a scalar loop counts the
remaining items.

Now gcc:

The inner loop works on 8 elements of keylocks[] per iteration,
initially with 8 32-bit elements per SIMD instruction (AVX-256), later
with 4 64-bit elements per SIMD instruction.  It ands, then
zero-extends (so it needs only one and instead of two for anding after
the extension), ands and adds, it then adds the two result ymm
register to one ymm accumulator (in a latency-minimizing way).

Again, there is not alignment code before the inner loop, just setup.
The elements of the ymm accumulator are summed up into one result.
Then, if there are 4 elements or more remaining, it performs the setup
and SIMD code for working on 4 elements with AVX-128 code, including
after the width expansion; so it uses two AVX-128 instructions for
performing the zero-extension rather thanm one AVX-256 instruction.
It also sets up all the constants in the xmm registers (again, in case
of coming from the inner loop); it finally sums up the result and adds
it to the result of the inner loop.  Finally, the last 0-3 iterations
are performed in a scalar loop.


The gcc-12 code is more sophisticated in several respects than the
clang-14 code, but also has several obvious ways to improve it: it
could use sign-extension and subtraction (clang, too); the handling of
the final 4-7 items in gcc could work with 4-element-SIMD throughout
and reuse the contents of the registers that hold constants, and the
result could be added to the SIMD accumulator, summing that up only
afterwards.  Maybe the gcc maintainers have performed some of these
improvements in the meantime.


Given the description of Terje Mathisen, my guess is that part1 and li
have 32-bit types, so the C equivalent would be:

unsigned inner(unsigned li, unsigned lock, unsigned keylocks[], unsigned part1)
{
  unsigned k;
  for (k=0; k<li; k++)
    part1 += (lock & keylocks[k])==0;
  return part1;
}

With that I see shorter code from both gcc and clang, unsurprising
given the complications that fall away.


I looked also at the code that the compilers produce with
-march=x86-64-v4 (which includes AVX-512) for the latter function.
clang generates code that does not use (512-bit) zmm registers not the
k (predicate) registers.  It is different from and longer than the
-mavx2 code, though.

gcc uses both zmm and k registers.  The inner loop is quite small:

  40:   62 f1 65 48 db 08       vpandd (%rax),%zmm3,%zmm1
  46:   48 83 c0 40             add    $0x40,%rax
  4a:   62 f2 76 48 27 c9       vptestnmd %zmm1,%zmm1,%k1
  50:   62 f1 7d c9 6f ca       vmovdqa32 %zmm2,%zmm1{%k1}{z}
  56:   62 f1 7d 48 fa c1       vpsubd %zmm1,%zmm0,%zmm0
  5c:   48 39 c2                cmp    %rax,%rdx
  5f:   75 df                   jne    40 <inner+0x40>

It works on 16 elements per iteration.  It uses vptestnmd and
vmovdqa32 to set zmm1 to -1 (zmm2 is set to -1 in the setup) or 0
(using the {z} option); it's unclear to me why it does not use
vpcmpeqd, which would have produced the same result in one
instruction.  Without the intervening extension, the compiler is able
to use a subtration of -1.

The inner loop is preceded by setup (again, no alignment).  It is
followed by summing the results and adding them to a 32-bit
accumulator, than an optional use of AVX-256 for 8 elements (and
summing the results and adding them to the accumulator, and finally by
a completely unrolled scalar loop for the last 0-7 elements.

What I would try is to use predication and AVX-512 for the last 1-15
(or maybe 1-16) elements, and only then sum up the results and add it
to part1.  I wonder why the gcc-12 developers did not make use of that
possibility (and looking at gcc-14.2 with godbolt, the code for that
apparently is still pretty similar; the clang-19 code has become
shorter, but still does not use zmm or k registers).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal