Re: Cost of handling misaligned access

Liste des Groupes 
Sujet : Re: Cost of handling misaligned access
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 06. Feb 2025, 11:30:49
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb6.113049@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8
User-Agent : xrn 10.11
Terje Mathisen <terje.mathisen@tmsw.no> writes:
Anton Ertl wrote:
If you have ever learned about vectorization, it's easy to see that
the inner loop can be vectorized.  And obviously auto-vectorization
has worked in this case, not particularly amazing to me.
>
I have some (30 years?) experience with auto-vectorization, usually I've
been (very?) disappointed.

I have often been disappointed, too.

As I wrote this was the best I have ever
seen, and the resulting code actually performed extremely close to
theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction.

What theory is behind that?  If we take the inner loop (from my
version that uses unsigned, not unsigned long) by clang-14.0.6 -O3
-mavx2:

  50:   c5 f5 db 34 0a          vpand  (%rdx,%rcx,1),%ymm1,%ymm6
  55:   c5 f5 db 7c 0a 20       vpand  0x20(%rdx,%rcx,1),%ymm1,%ymm7
  5b:   c5 75 db 44 0a 40       vpand  0x40(%rdx,%rcx,1),%ymm1,%ymm8
  61:   c5 75 db 4c 0a 60       vpand  0x60(%rdx,%rcx,1),%ymm1,%ymm9
  67:   c5 cd 76 f2             vpcmpeqd %ymm2,%ymm6,%ymm6
  6b:   c5 fd fa c6             vpsubd %ymm6,%ymm0,%ymm0
  6f:   c5 c5 76 f2             vpcmpeqd %ymm2,%ymm7,%ymm6
  73:   c5 e5 fa de             vpsubd %ymm6,%ymm3,%ymm3
  77:   c5 bd 76 f2             vpcmpeqd %ymm2,%ymm8,%ymm6
  7b:   c5 dd fa e6             vpsubd %ymm6,%ymm4,%ymm4
  7f:   c5 b5 76 f2             vpcmpeqd %ymm2,%ymm9,%ymm6
  83:   c5 d5 fa ee             vpsubd %ymm6,%ymm5,%ymm5
  87:   48 83 e9 80             sub    $0xffffffffffffff80,%rcx
  8b:   48 39 c8                cmp    %rcx,%rax
  8e:   75 c0                   jne    50 <inner+0x50>

I see that clang uses 4 ymm accumulators (ymm0, ymm3, ymm4, ymm5), so
the recurrences are only 1-cycle recurrences for the 4 vsubd
instructions and the sub instruction.  So, with enough resources, a
CPU core could perform 1 iteration per cycle (and with hardware
reassociation, even faster, but apart from adding constants in Alder
Lake ff., we are not there yet).  But current CPUs do not have that
many resources.  If we want to determine the maximum speed given
resource limits, we have to look at the concrete CPU model.

BTW, an alternative would be to do some summation already in each
iteration, but with still only a one-cycle recurrence.  This could
have looked like this:

vpand  (%rdx,%rcx,1),%ymm1,%ymm6
vpand  0x20(%rdx,%rcx,1),%ymm1,%ymm7
vpand  0x40(%rdx,%rcx,1),%ymm1,%ymm8
vpand  0x60(%rdx,%rcx,1),%ymm1,%ymm9
vpcmpeqd %ymm2,%ymm6,%ymm6
vpcmpeqd %ymm2,%ymm7,%ymm7
vpaddd %ymm6,%ymm7,%ymm7
vpcmpeqd %ymm2,%ymm8,%ymm8
vpcmpeqd %ymm2,%ymm9,%ymm9
vpaddd %ymm8,%ymm9,%ymm9
vpaddd %ymm7,%ymm9,%ymm9
vpsubd %ymm0,%ymm9,%ymm0
sub    $0xffffffffffffff80,%rcx
cmp    %rcx,%rax
jne    50 <inner+0x50>

Here the SIMD recurrence uses ymm0.  This saves having to perform the
summing up of SIMD accumulators after the inner loop.  gcc uses
something like that in one of the codes I looked at.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
4 Feb 25 * Re: Cost of handling misaligned access40Anton Ertl
5 Feb 25 `* Re: Cost of handling misaligned access39Terje Mathisen
5 Feb 25  +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  `* Re: Cost of handling misaligned access34Michael S
6 Feb 25   +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25   i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25   i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25   i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25   i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25   i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25   i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25   i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25   i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25   i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25   i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25   i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25   i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25   i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25   i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25   i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25   i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25   i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25   i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25   i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25   i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25   i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25   i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25   i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25   i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25   i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25   i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25   i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25   i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25   `- Re: Cost of handling misaligned access1Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal