Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : terje.mathisen (at) *nospam* tmsw.no (Terje Mathisen)
Groupes : comp.arch
Date : 06. Feb 2025, 13:57:12
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vo2bj9$2v5vm$1@dont-email.me>
References : 1 2 3 4 5 6 7
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0 SeaMonkey/2.53.20
Michael S wrote:
On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
 
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
  
As SIMD no longer requires alignment, presumably code no longer
does so.
>
Yes, if you use AVX/AVX2, you don't encounter this particular Intel
stupidity.
>
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up with
code like this:
>
>
      let mut part1 = 0;
      for l in li..keylocks.len() {
          let lock = keylocks[l];
          for k in 0..li {
              let sum = lock & keylocks[k];
              if sum == 0 {
                  part1 += 1;
              }
          }
      }
>
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
>
This resulted in just 12 instructions to handle 32 tests.
>
 That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison. The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Especially so when data is not aligned.
Anton already replied, as he wrote the total loop overhead is just three instructions, all of which can (& will?) overlap with the AVX instructions.
Due to the combined AVX and 4x unroll, the original scalar code is alreayd unrolled 32 x, so the loop overhead can mostly be ignored.
If the cpu has enough resources to run more than one 32-byte AVX instruction per cycle, then the same code will allow all four copies to run at the same time, but the timing I see on my laptop (93 ps) corresponds closely to one AVX op/cycle.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date Sujet#  Auteur
4 Feb 25 * Re: Cost of handling misaligned access40Anton Ertl
5 Feb 25 `* Re: Cost of handling misaligned access39Terje Mathisen
5 Feb 25  +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  `* Re: Cost of handling misaligned access34Michael S
6 Feb 25   +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25   i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25   i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25   i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25   i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25   i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25   i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25   i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25   i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25   i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25   i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25   i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25   i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25   i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25   i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25   i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25   i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25   i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25   i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25   i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25   i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25   i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25   i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25   i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25   i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25   i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25   i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25   i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25   i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25   i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25   `- Re: Cost of handling misaligned access1Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal