Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : terje.mathisen (at) *nospam* tmsw.no (Terje Mathisen)
Groupes : comp.arch
Date : 06. Feb 2025, 13:57:12
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vo2bj9$2v5vm$1@dont-email.me>
References : 1 2 3 4 5 6 7
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0 SeaMonkey/2.53.20
Michael S wrote:
On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
 
Anton Ertl wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
  
As SIMD no longer requires alignment, presumably code no longer
does so.
>
Yes, if you use AVX/AVX2, you don't encounter this particular Intel
stupidity.
>
Recently, on the last day (Dec 25th) of Advent of Code, I had a
problem which lent itself to using 32-bit bitmaps: The task was to
check which locks were compatible with which keys, so I ended up with
code like this:
>
>
      let mut part1 = 0;
      for l in li..keylocks.len() {
          let lock = keylocks[l];
          for k in 0..li {
              let sum = lock & keylocks[k];
              if sum == 0 {
                  part1 += 1;
              }
          }
      }
>
Telling the rust compiler to target my AVX2-capable laptop CPU (an
Intel i7), I got code that simply amazed me: The compiler unrolled
the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
lock into 4 AVX registers (vpand), then comparing with a zeroed
register (vpcmpeqd) (generating -1/0 results) before subtracting
(vpsubd) those from 4 accumulators.
>
This resulted in just 12 instructions to handle 32 tests.
>
 That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison. The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Especially so when data is not aligned.
Anton already replied, as he wrote the total loop overhead is just three instructions, all of which can (& will?) overlap with the AVX instructions.
Due to the combined AVX and 4x unroll, the original scalar code is alreayd unrolled 32 x, so the loop overhead can mostly be ignored.
If the cpu has enough resources to run more than one 32-byte AVX instruction per cycle, then the same code will allow all four copies to run at the same time, but the timing I see on my laptop (93 ps) corresponds closely to one AVX op/cycle.
Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Date Sujet#  Auteur
3 Feb 25 o Re: Cost of handling misaligned access106Anton Ertl

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal