Re: Cost of handling misaligned access

Liste des GroupesRevenir à c arch 
Sujet : Re: Cost of handling misaligned access
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.arch
Date : 08. Feb 2025, 18:21:19
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20250208192119.0000148e@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
User-Agent : Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
>
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code  compiled with clang -O3 -march=skylake (AVX2)
>
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18 
 
In the meantime, I also wrote the original code in plain C
(keylocks1.c), then implemented your idea of unrolling the outer loop
and comparing a subarray of locks to each key (is this called strip
mining?) in plain C (with the hope that auto-vectorization works) as
keylocks2.c, and finally rewrote the latter version to use gcc vector
extensions (keylocks3.c).  I wrote a dummy main around that that calls
the routine 100_000 times; given that the original routine's
performance does not depend on the data, and I used non-0 keys (so
keylocks[23].c does not skip any keys), the actual data is not
important.
 

I used keys filled with pseudo-random bits with probability of 0 0.705. The probability was chosen to get final results similar to
Terje's.


You can find the source code and the binaries I measured at
<http://www.complang.tuwien.ac.at/anton/keylock/>.  The binaries were
compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
the clang compilations sometimes used different UNROLL factors than
the gcc compilations (and I am unsure, which, see below).
 
The original code is:
 
unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[],
unsigned nlocks) {
  unsigned i, j;
  unsigned part1 = 0;
  for (i=0; i<nlocks; i++) {
    unsigned lock = locks[i];
    for (j=0; j<nkeys; j++)
      part1 += (lock & keys[j])==0;
  }
  return part1;
}
 
For keylocks2.c the central loops are:
 
  for (i=0; i<UNROLL; i++)
    part0[i]=0;
  for (i=0; i<nlocks1; i+=UNROLL) {
    for (j=0; j<nkeys1; j++) {
      unsigned key = keys1[j];
      for (k=0; k<UNROLL; k++)
        part0[k] += (locks1[i+k] & key)==0;
    }
  }
 
For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
AVX-512; the numbers below are for those factors that produce the
lowest cycles on the Rocket Lake machine.
 
The central loops are preceded by code to arrange the data such that
this code works: locks are copied to the longer locks1; the length of
locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
to increase the count by 0) and the keys are copies to keys1 (with 0
removed so that the extra locks are not counted, and that also may
increase efficiency if there is a key=0).  The central loops are
followed by summing up the elements of part0.
 
keylocks3.c, which uses the gcc vector extensions, just changes
keylocks2.c in a few places.  In particular, it adds a type vu:
 
typedef unsigned vu __attribute__ ((vector_size
(UNROLL*sizeof(unsigned))));
 
The central loops now look as follows:
 
  for (i=0; i<UNROLL; i++)
    part0[i]=0;
  for (i=0; i<nlocks1; i+=UNROLL) {
    vu lock = *(vu *)(locks1+i);
    for (j=0; j<nkeys1; j++) {
      part0 -= (lock & keys1[j])==0;
    }
 
One interesting aspect of the gcc vector extensions is that the result
of comparing two vectors is 0 (false) or ~0 (true) (per element),
whereas for scalars the value for true is 1.  Therefore the code above
updates part0 with -=, whereas in keylocks2.c += is used.
 
While the use of ~0 is a good choice when designing a new programming
language, I would have gone for 1 in the case of a vector extension
for C, for consistency with the scalar case; in combination with
hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
means that the compiler will introduce a negation in its intermediate
representation at some point; I expect that compilers will usually be
able to optimize this negation away, but I would not be surprised at
cases where my expectation is disappointed.
 
keylocks3.c compiles without warning on clang, but the result usually
segfaults (but sometime does not, e.g., in the timed run on Zen4; it
segfaults in other runs on Zen4).  I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.
 
I would have liked to compare the performance of my code against your
code, but your code apparently was destroyed by arbitrary line
breaking in your news-posting software.

Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:

#include <stdint.h>
#include <immintrin.h>

int foo_tst(const uint32_t* keylocks, int len, int li)
{
  if (li >= len || li <= 0)
    return 0;
  const uint32_t* keyx = &keylocks[li];
  unsigned ni = len - li;
  __m256i res0 = _mm256_setzero_si256();
  __m256i res1 = _mm256_setzero_si256();
  __m256i res2 = _mm256_setzero_si256();
  __m256i res3 = _mm256_setzero_si256();
  const uint32_t* keyx_last = &keyx[ni & -32];
  for (; keyx != keyx_last; keyx += 32) {
    __m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
    __m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
    __m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
    __m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
    for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { __m256i lockk _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyy)); res0 _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32];
    if (li % 32) {
      for (int k = 0; k < li % 32; ++k)
        tmp[k] = keyy_last[k];
      for (int k = li % 32; k < 32; ++k)
        tmp[k] = (uint32_t)-1;
    }
    const uint32_t* keyx_last = &keyx[ni % 32];
    int nz = 0;
    for (; keyx != keyx_last; keyx += 1) {
      if (*keyx) {
        __m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const
__m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const
__m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const
__m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
  // fold accumulators
  res0 = _mm256_add_epi32(res0, res2);
  res1 = _mm256_add_epi32(res1, res3);
  res0 = _mm256_add_epi32(res0, res1);
  res0 = _mm256_hadd_epi32(res0, res0);
  res0 = _mm256_hadd_epi32(res0, res0);

  res += _mm256_extract_epi32(res0, 0);
  res += _mm256_extract_epi32(res0, 4);
  return res;
}


Anyway, here are my results.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.
 

I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and 5.15
usec per call on Haswell. You numbers don't even resamble anything like
that.

The cores are:
 

<snip>

The number of instructions executed is
(reported on the Zen4):
 
instructions
5_779_542_242  gcc   avx2 1  
3_484_942_148  gcc   avx2 2 8
5_885_742_164  gcc   avx2 3 8
7_903_138_230  clang avx2 1  
7_743_938_183  clang avx2 2 8?
3_625_338_104  clang avx2 3 8?
4_204_442_194  gcc   512  1  
2_564_142_161  gcc   512  2 32
3_061_042_178  gcc   512  3 16
7_703_938_205  clang 512  1  
3_402_238_102  clang 512  2 16?
3_320_455_741  clang 512  3 16?
 

I don't understand these numbers either. For original clang, I'd expect
25,000 instructions per call. Or 33,000 if, unlike in Terje's Rust
case, your clang generates RISC-style  sequences. Your number is somehow
240,000 times bigger.

for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
while microarchitecture descriptions report only a 6-wide renamer
<https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
My guess is that the front end combined some instructions (maybe
compare and branch) into a macro-op, and the renamer then processed 6
macro-ops that represented more instructions.  The inner loop is
 
       │190:   vpbroadcastd (%rax),%ymm0
  1.90 │       add          $0x4,%rax
       │       vpand        %ymm2,%ymm0,%ymm0
  1.09 │       vpcmpeqd     %ymm3,%ymm0,%ymm0
  0.41 │       vpsubd       %ymm0,%ymm1,%ymm1
 78.30 │       cmp          %rdx,%rax
       │       jne          190
 
and if the cmp and jne are combined into one macro-op, that would be
perfect for executing one iteration per cycle.
 
It's interesting that gcc's keylocks2-256 results on far fewer
instructions (and eventually, cycles).  It unrolls the inner loop 8
times to process the keys in SIMD fashion, too, loading the keys one
ymm register at a time.  In order to do that it arranges the locks in
8 different ymm registers in the outer loop, so the inner loop
performs 8 sequences similar to
 
vpand        %ymm0,%ymm15,%ymm2
vpcmpeqd     %ymm1,%ymm2,%ymm2
vpsubd       %ymm2,%ymm4,%ymm4
 
surrounded by
 
300:   vmovdqu      (%rsi),%ymm0
       add          $0x20,%rsi
       [8 3-instruction sequences]
       cmp          %rsi,%rdx
       jne          300
 
It also uses 8 ymm accumulators, so not all of that fits into
registers, so three of the anded values are stored on the stack.  For
Zen4 this could be improved by using only 2 accumulators.  In any
case, the gcc people did something clever here, and I do not
understand how they got there from the source code, and why they did
not get there from keylocks1.c.
 
For clang's keylocks3-256 the inner loop and the outer loop are each
unrolled two times, resulting in and inner loop like:
 
190:   vpbroadcastd (%r12,%rbx,4),%ymm5   
       vpand        %ymm3,%ymm5,%ymm6     
       vpand        %ymm4,%ymm5,%ymm5     
       vpcmpeqd     %ymm1,%ymm5,%ymm5     
       vpsubd       %ymm5,%ymm2,%ymm2     
       vpcmpeqd     %ymm1,%ymm6,%ymm5     
       vpsubd       %ymm5,%ymm0,%ymm0     
       vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
       vpand        %ymm4,%ymm5,%ymm6     
       vpand        %ymm3,%ymm5,%ymm5     
       vpcmpeqd     %ymm1,%ymm5,%ymm5     
       vpsubd       %ymm5,%ymm0,%ymm0     
       vpcmpeqd     %ymm1,%ymm6,%ymm5     
       vpsubd       %ymm5,%ymm2,%ymm2     
       add          $0x2,%rbx             
       cmp          %rbx,%rsi             
       jne          190                   
 
This results in the lowest AVX2 cycles, and I expect that one can use
that approach without crash problems without adding too many cycles.
The clang -march=x86-64-v4 results have similar code (with twice as
much inner-loop unrolling in case of keylocks3-512), but they all only
use AVX2 instructions and there have been successful runs on a Zen2
(which does not support AVX-512).  It seems that clang does not
support AVX-512, or it does not understand -march=x86-64-v4 to allow
more than AVX2.
 
The least executed instructions is with gcc's keylocks2-512, where the
inner loop is:
 
230:   vpbroadcastd  0x4(%rax),%zmm4     
       vpbroadcastd  (%rax),%zmm0        
       mov           %edx,%r10d          
       add           $0x8,%rax           
       add           $0x2,%edx           
       vpandd        %zmm4,%zmm8,%zmm5   
       vpandd        %zmm0,%zmm8,%zmm9   
       vpandd        %zmm4,%zmm6,%zmm4   
       vptestnmd     %zmm5,%zmm5,%k1     
       vpandd        %zmm0,%zmm6,%zmm0   
       vmovdqa32     %zmm7,%zmm5{%k1}{z} 
       vptestnmd     %zmm9,%zmm9,%k1     
       vmovdqa32     %zmm3,%zmm9{%k1}{z} 
       vptestnmd     %zmm4,%zmm4,%k1     
       vpsubd        %zmm9,%zmm5,%zmm5   
       vpaddd        %zmm5,%zmm2,%zmm2   
       vmovdqa32     %zmm7,%zmm4{%k1}{z} 
       vptestnmd     %zmm0,%zmm0,%k1     
       vmovdqa32     %zmm3,%zmm0{%k1}{z} 
       vpsubd        %zmm0,%zmm4,%zmm0   
       vpaddd        %zmm0,%zmm1,%zmm1   
       cmp           %r10d,%r8d          
       jne           230                 
 
Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
loop at a time, and the inner loop is unrolled by a factor of 2, too.
It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd
(why?).  Anyway, the code seems to rub Zen4 the wrong way, and it
performs only at 2.84 IPC, worse than the AVX2 code.  Rocket Lake
performs slightly better, but still, the clang code for keylocks2-512
runs a bit faster without using AVX-512.
 
I also saw one case where the compiler botched it:
 
gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c
 
[/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e
instructions keylocks3-256 603800000
 
 Performance counter stats for 'keylocks3-256':
 
    17_476_700_581      cycles
39_480_242_683      instructions                     #    2.26  insn
per cycle           
 
       3.506995312 seconds time elapsed
 
       3.507020000 seconds user
       0.000000000 seconds sys
 
(cycles and timings on the 8700G).  Here the compiler failed to
vectorize the comparison, and performed them using scalar instructions
(first extracting the data from the SIMD registers, and finally
inserting the result into SIMD registers, with additional overhead
from spilling registers).  The result requires about 10 times more
instructions than the UNROLL=8 variant and almost 20 times more
cycles.
 
On to timings per routine invocation:
 
On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized
code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual
vectorized code)
 
Michael S.'s "original code" performs similar on clang to my
keylocks1.c.  clang's keylocks2-512 code is quite competetive with his
manual code.
 

Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.

On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
 

I always thought that NUCs have better cooling than all, but high-end
laptops. Was I wrong? Such slowness is disappointing.

5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
 

So, for the best-performing variant IPC of Goldeen Cove is identical to
ancient Haswell? That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2
integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck? May
be, a single accumulator? If it is the case, my code should run
about the same as clang's on resource-starved Haswell, but measurably
faster on Goldden Cove.

I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower.  Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
 

I prefer time, because at the end it's the only thing that matter.

- anton



Date Sujet#  Auteur
2 Feb 25 * Re: Cost of handling misaligned access112BGB
3 Feb 25 +* Re: Cost of handling misaligned access2MitchAlsup1
3 Feb 25 i`- Re: Cost of handling misaligned access1BGB
3 Feb 25 `* Re: Cost of handling misaligned access109Anton Ertl
3 Feb 25  +* Re: Cost of handling misaligned access11BGB
3 Feb 25  i`* Re: Cost of handling misaligned access10Anton Ertl
3 Feb 25  i +- Re: Cost of handling misaligned access1BGB
3 Feb 25  i `* Re: Cost of handling misaligned access8Thomas Koenig
4 Feb 25  i  `* Re: Cost of handling misaligned access7Anton Ertl
4 Feb 25  i   +* Re: Cost of handling misaligned access5Thomas Koenig
4 Feb 25  i   i`* Re: Cost of handling misaligned access4Anton Ertl
4 Feb 25  i   i +* Re: Cost of handling misaligned access2Thomas Koenig
10 Feb 25  i   i i`- Re: Cost of handling misaligned access1Mike Stump
10 Feb 25  i   i `- Re: Cost of handling misaligned access1Mike Stump
4 Feb 25  i   `- Re: Cost of handling misaligned access1MitchAlsup1
3 Feb 25  +* Re: Cost of handling misaligned access3Thomas Koenig
3 Feb 25  i`* Re: Cost of handling misaligned access2BGB
3 Feb 25  i `- Re: Cost of handling misaligned access1MitchAlsup1
4 Feb 25  +* Re: Cost of handling misaligned access41Anton Ertl
5 Feb 25  i`* Re: Cost of handling misaligned access40Terje Mathisen
5 Feb 25  i +* Re: Cost of handling misaligned access4Anton Ertl
5 Feb 25  i i+* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i ii`- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i i`- Re: Cost of handling misaligned access1Anton Ertl
5 Feb 25  i `* Re: Cost of handling misaligned access35Michael S
6 Feb 25  i  +* Re: Cost of handling misaligned access32Anton Ertl
6 Feb 25  i  i`* Re: Cost of handling misaligned access31Michael S
6 Feb 25  i  i +* Re: Cost of handling misaligned access2Anton Ertl
6 Feb 25  i  i i`- Re: Cost of handling misaligned access1Michael S
6 Feb 25  i  i `* Re: Cost of handling misaligned access28Terje Mathisen
6 Feb 25  i  i  `* Re: Cost of handling misaligned access27Terje Mathisen
6 Feb 25  i  i   `* Re: Cost of handling misaligned access26Michael S
6 Feb 25  i  i    `* Re: Cost of handling misaligned access25Terje Mathisen
6 Feb 25  i  i     +* Re: Cost of handling misaligned access19Michael S
7 Feb 25  i  i     i`* Re: Cost of handling misaligned access18Terje Mathisen
7 Feb 25  i  i     i `* Re: Cost of handling misaligned access17Michael S
7 Feb 25  i  i     i  `* Re: Cost of handling misaligned access16Terje Mathisen
7 Feb 25  i  i     i   `* Re: Cost of handling misaligned access15Michael S
7 Feb 25  i  i     i    +- Re: Cost of handling misaligned access1Terje Mathisen
7 Feb 25  i  i     i    +* Re: Cost of handling misaligned access3MitchAlsup1
8 Feb 25  i  i     i    i+- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i    i`- Re: Cost of handling misaligned access1Michael S
8 Feb 25  i  i     i    `* Re: Cost of handling misaligned access10Anton Ertl
8 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Terje Mathisen
8 Feb 25  i  i     i     +* Re: Cost of handling misaligned access6Michael S
8 Feb 25  i  i     i     i`* Re: Cost of handling misaligned access5Anton Ertl
8 Feb 25  i  i     i     i +- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i +* Re: Cost of handling misaligned access2Michael S
11 Feb 25  i  i     i     i i`- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     i `- Re: Cost of handling misaligned access1Michael S
9 Feb 25  i  i     i     +- Re: Cost of handling misaligned access1Michael S
10 Feb 25  i  i     i     `- Re: Cost of handling misaligned access1Michael S
7 Feb 25  i  i     `* Re: Cost of handling misaligned access5BGB
7 Feb 25  i  i      `* Re: Cost of handling misaligned access4MitchAlsup1
7 Feb 25  i  i       `* Re: Cost of handling misaligned access3BGB
8 Feb 25  i  i        `* Re: Cost of handling misaligned access2Anssi Saari
8 Feb 25  i  i         `- Re: Cost of handling misaligned access1BGB
6 Feb 25  i  `* Re: Cost of handling misaligned access2Terje Mathisen
6 Feb 25  i   `- Re: Cost of handling misaligned access1Michael S
6 Feb 25  +* Re: Cost of handling misaligned access5Waldek Hebisch
6 Feb 25  i+* Re: Cost of handling misaligned access3Anton Ertl
6 Feb 25  ii`* Re: Cost of handling misaligned access2Waldek Hebisch
6 Feb 25  ii `- Re: Cost of handling misaligned access1Anton Ertl
6 Feb 25  i`- Re: Cost of handling misaligned access1Terje Mathisen
13 Feb 25  `* Re: Cost of handling misaligned access48Marcus
13 Feb 25   +- Re: Cost of handling misaligned access1Thomas Koenig
14 Feb 25   +* Re: Cost of handling misaligned access41BGB
14 Feb 25   i`* Re: Cost of handling misaligned access40MitchAlsup1
18 Feb 25   i `* Re: Cost of handling misaligned access39BGB
18 Feb 25   i  +* Re: Cost of handling misaligned access33MitchAlsup1
18 Feb 25   i  i+- Re: Cost of handling misaligned access1BGB
18 Feb 25   i  i`* Re: Cost of handling misaligned access31Michael S
18 Feb 25   i  i +- Re: Cost of handling misaligned access1Thomas Koenig
18 Feb 25   i  i +* Re: Cost of handling misaligned access26MitchAlsup1
18 Feb 25   i  i i`* Re: Cost of handling misaligned access25Terje Mathisen
18 Feb 25   i  i i `* Re: Cost of handling misaligned access24MitchAlsup1
19 Feb 25   i  i i  `* Re: Cost of handling misaligned access23Terje Mathisen
19 Feb 25   i  i i   `* Re: Cost of handling misaligned access22MitchAlsup1
19 Feb 25   i  i i    `* Re: Cost of handling misaligned access21BGB
20 Feb 25   i  i i     +- Re: Cost of handling misaligned access1Robert Finch
20 Feb 25   i  i i     +* Re: Cost of handling misaligned access5MitchAlsup1
20 Feb 25   i  i i     i+* Re: Cost of handling misaligned access2BGB
20 Feb 25   i  i i     ii`- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     i`* Re: Cost of handling misaligned access2Robert Finch
21 Feb 25   i  i i     i `- Re: Cost of handling misaligned access1BGB
21 Feb 25   i  i i     `* Re: Cost of handling misaligned access14BGB
22 Feb 25   i  i i      +- Re: Cost of handling misaligned access1Robert Finch
22 Feb 25   i  i i      `* Re: Cost of handling misaligned access12Robert Finch
23 Feb 25   i  i i       +* Re: Cost of handling misaligned access10BGB
23 Feb 25   i  i i       i`* Re: Cost of handling misaligned access9Michael S
24 Feb 25   i  i i       i +- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i `* Re: Cost of handling misaligned access7Michael S
24 Feb 25   i  i i       i  +* Re: Cost of handling misaligned access4Robert Finch
24 Feb 25   i  i i       i  i+- Re: Cost of handling misaligned access1BGB
24 Feb 25   i  i i       i  i`* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i  i `- Re: Cost of handling misaligned access1BGB
25 Feb 25   i  i i       i  `* Re: Cost of handling misaligned access2MitchAlsup1
25 Feb 25   i  i i       i   `- Re: Cost of handling misaligned access1BGB
23 Feb 25   i  i i       `- Re: Cost of handling misaligned access1Robert Finch
18 Feb 25   i  i `* Re: Cost of handling misaligned access3BGB
19 Feb 25   i  i  `* Re: Cost of handling misaligned access2MitchAlsup1
18 Feb 25   i  `* Re: Cost of handling misaligned access5Robert Finch
17 Feb 25   `* Re: Cost of handling misaligned access5Terje Mathisen

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal