Newsportal USENET - Re: Cost of handling misaligned access

On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

Michael S <already5chosen@yahoo.com> writes:
In the mean time.
I did few measurements on Xeon E3 1271 v3. That is rather old uArch -
Haswell, the first core that supports AVX2. During the tests it was
running at 4.0 GHz.
>
1. Original code (rewritten in plain C) compiled with clang -O3
-march=ivybridge (no AVX2) 2. Original code (rewritten in plain C)
compiled with clang -O3 -march=haswell (AVX2) 3. Manually vectorized
AVX2 code compiled with clang -O3 -march=skylake (AVX2)
>
Results were as following (usec/call)
1 - 5.66
2 - 5.56
3 - 2.18

In the meantime, I also wrote the original code in plain C
(keylocks1.c), then implemented your idea of unrolling the outer loop
and comparing a subarray of locks to each key (is this called strip
mining?) in plain C (with the hope that auto-vectorization works) as
keylocks2.c, and finally rewrote the latter version to use gcc vector
extensions (keylocks3.c). I wrote a dummy main around that that calls
the routine 100_000 times; given that the original routine's
performance does not depend on the data, and I used non-0 keys (so
keylocks[23].c does not skip any keys), the actual data is not
important.

I used keys filled with pseudo-random bits with probability of 0 0.705. The probability was chosen to get final results similar to
Terje's.

You can find the source code and the binaries I measured at
<http://www.complang.tuwien.ac.at/anton/keylock/>. The binaries were
compiled with gcc 12.2.0 and (in the clang subdirectory) clang-14.0.6;
the clang compilations sometimes used different UNROLL factors than
the gcc compilations (and I am unsure, which, see below).

The original code is:

unsigned keylocks(unsigned keys[], unsigned nkeys, unsigned locks[],
unsigned nlocks) {
unsigned i, j;
unsigned part1 = 0;
for (i=0; i<nlocks; i++) {
unsigned lock = locks[i];
for (j=0; j<nkeys; j++)
part1 += (lock & keys[j])==0;
}
return part1;
}

For keylocks2.c the central loops are:

for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
for (j=0; j<nkeys1; j++) {
unsigned key = keys1[j];
for (k=0; k<UNROLL; k++)
part0[k] += (locks1[i+k] & key)==0;
}
}

For UNROLL I tried 8, 16, and 32 for AVX2 and 16, 32, or 64 for
AVX-512; the numbers below are for those factors that produce the
lowest cycles on the Rocket Lake machine.

The central loops are preceded by code to arrange the data such that
this code works: locks are copied to the longer locks1; the length of
locks1 is a multiple of UNROLL, and the entries beyond nlocks are ~0
to increase the count by 0) and the keys are copies to keys1 (with 0
removed so that the extra locks are not counted, and that also may
increase efficiency if there is a key=0). The central loops are
followed by summing up the elements of part0.

keylocks3.c, which uses the gcc vector extensions, just changes
keylocks2.c in a few places. In particular, it adds a type vu:

typedef unsigned vu __attribute__ ((vector_size
(UNROLL*sizeof(unsigned))));

The central loops now look as follows:

for (i=0; i<UNROLL; i++)
part0[i]=0;
for (i=0; i<nlocks1; i+=UNROLL) {
vu lock = *(vu *)(locks1+i);
for (j=0; j<nkeys1; j++) {
part0 -= (lock & keys1[j])==0;
}

One interesting aspect of the gcc vector extensions is that the result
of comparing two vectors is 0 (false) or ~0 (true) (per element),
whereas for scalars the value for true is 1. Therefore the code above
updates part0 with -=, whereas in keylocks2.c += is used.

While the use of ~0 is a good choice when designing a new programming
language, I would have gone for 1 in the case of a vector extension
for C, for consistency with the scalar case; in combination with
hardware that produces ~0 (e.g., Intel SSE and AVX SIMD stuff), that
means that the compiler will introduce a negation in its intermediate
representation at some point; I expect that compilers will usually be
able to optimize this negation away, but I would not be surprised at
cases where my expectation is disappointed.

keylocks3.c compiles without warning on clang, but the result usually
segfaults (but sometime does not, e.g., in the timed run on Zen4; it
segfaults in other runs on Zen4). I have not investigated why this
happens, I just did not include results from runs where it segfaulted;
and I tried additional runs for keylocks3-512 on Zen4 in order to have
one result there.

I would have liked to compare the performance of my code against your
code, but your code apparently was destroyed by arbitrary line
breaking in your news-posting software.

Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:

#include <stdint.h>
#include <immintrin.h>

int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li >= len || li <= 0)
return 0;
const uint32_t* keyx = &keylocks[li];
unsigned ni = len - li;
__m256i res0 = _mm256_setzero_si256();
__m256i res1 = _mm256_setzero_si256();
__m256i res2 = _mm256_setzero_si256();
__m256i res3 = _mm256_setzero_si256();
const uint32_t* keyx_last = &keyx[ni & -32];
for (; keyx != keyx_last; keyx += 32) {
__m256i lock0 = _mm256_loadu_si256((const __m256i*)&keyx[0*8]);
__m256i lock1 = _mm256_loadu_si256((const __m256i*)&keyx[1*8]);
__m256i lock2 = _mm256_loadu_si256((const __m256i*)&keyx[2*8]);
__m256i lock3 = _mm256_loadu_si256((const __m256i*)&keyx[3*8]);
for (const uint32_t* keyy = keylocks; keyy != &keylocks[li];
++keyy) { __m256i lockk _mm256_castps_si256(_mm256_broadcast_ss((const float*)keyy)); res0 _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } int res = 0; if (ni % 32) { uint32_t
tmp[32]; const uint32_t* keyy_last = &keylocks[li & -32];
if (li % 32) {
for (int k = 0; k < li % 32; ++k)
tmp[k] = keyy_last[k];
for (int k = li % 32; k < 32; ++k)
tmp[k] = (uint32_t)-1;
}
const uint32_t* keyx_last = &keyx[ni % 32];
int nz = 0;
for (; keyx != keyx_last; keyx += 1) {
if (*keyx) {
__m256i lockk = _mm256_castps_si256(_mm256_broadcast_ss((const
float*)keyx)); for (const uint32_t* keyy = keylocks; keyy != keyy_last;
keyy += 32) { __m256i lock0 = _mm256_loadu_si256((const
__m256i*)&keyy[0*8]); __m256i lock1 = _mm256_loadu_si256((const
__m256i*)&keyy[1*8]); __m256i lock2 = _mm256_loadu_si256((const
__m256i*)&keyy[2*8]); __m256i lock3 = _mm256_loadu_si256((const
__m256i*)&keyy[3*8]); res0 = _mm256_sub_epi32(res0,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock0),
_mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } if (li % 32) { __m256i lock0 _mm256_loadu_si256((const __m256i*)&tmp[0*8]); __m256i lock1 _mm256_loadu_si256((const __m256i*)&tmp[1*8]); __m256i lock2 _mm256_loadu_si256((const __m256i*)&tmp[2*8]); __m256i lock3 _mm256_loadu_si256((const __m256i*)&tmp[3*8]); res0 _mm256_sub_epi32(res0, _mm256_cmpeq_epi32(_mm256_and_si256(lockk,
lock0), _mm256_setzero_si256())); res1 = _mm256_sub_epi32(res1,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock1),
_mm256_setzero_si256())); res2 = _mm256_sub_epi32(res2,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock2),
_mm256_setzero_si256())); res3 = _mm256_sub_epi32(res3,
_mm256_cmpeq_epi32(_mm256_and_si256(lockk, lock3),
_mm256_setzero_si256())); } } else { nz += 1; } } res = nz * li; }
// fold accumulators
res0 = _mm256_add_epi32(res0, res2);
res1 = _mm256_add_epi32(res1, res3);
res0 = _mm256_add_epi32(res0, res1);
res0 = _mm256_hadd_epi32(res0, res0);
res0 = _mm256_hadd_epi32(res0, res0);

res += _mm256_extract_epi32(res0, 0);
res += _mm256_extract_epi32(res0, 4);
return res;
}

Anyway, here are my results.
First cycles (which eliminates worries about turbo modes) and
instructions, then usec/call.

I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and 5.15
usec per call on Haswell. You numbers don't even resamble anything like
that.

The cores are:

<snip>

The number of instructions executed is
(reported on the Zen4):

instructions
5_779_542_242 gcc   avx2 1
3_484_942_148 gcc   avx2 2 8
5_885_742_164 gcc   avx2 3 8
7_903_138_230 clang avx2 1
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?
4_204_442_194 gcc   512 1
2_564_142_161 gcc   512 2 32
3_061_042_178 gcc   512 3 16
7_703_938_205 clang 512 1
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?

I don't understand these numbers either. For original clang, I'd expect
25,000 instructions per call. Or 33,000 if, unlike in Terje's Rust
case, your clang generates RISC-style sequences. Your number is somehow
240,000 times bigger.

for gcc -mavx2 on keylocks3.c on Zen 4 an IPC of 6.44 is reported,
while microarchitecture descriptions report only a 6-wide renamer
<https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine>.
My guess is that the front end combined some instructions (maybe
compare and branch) into a macro-op, and the renamer then processed 6
macro-ops that represented more instructions. The inner loop is

   â”‚190:   vpbroadcastd (%rax),%ymm0
1.90 â”‚    add $0x4,%rax
   â”‚    vpand %ymm2,%ymm0,%ymm0
1.09 â”‚    vpcmpeqd    %ymm3,%ymm0,%ymm0
0.41 â”‚    vpsubd    %ymm0,%ymm1,%ymm1
78.30 â”‚    cmp %rdx,%rax
   â”‚    jne 190

and if the cmp and jne are combined into one macro-op, that would be
perfect for executing one iteration per cycle.

It's interesting that gcc's keylocks2-256 results on far fewer
instructions (and eventually, cycles). It unrolls the inner loop 8
times to process the keys in SIMD fashion, too, loading the keys one
ymm register at a time. In order to do that it arranges the locks in
8 different ymm registers in the outer loop, so the inner loop
performs 8 sequences similar to

vpand %ymm0,%ymm15,%ymm2
vpcmpeqd    %ymm1,%ymm2,%ymm2
vpsubd    %ymm2,%ymm4,%ymm4

surrounded by

300:   vmovdqu (%rsi),%ymm0
   add $0x20,%rsi
   [8 3-instruction sequences]
   cmp %rsi,%rdx
   jne 300

It also uses 8 ymm accumulators, so not all of that fits into
registers, so three of the anded values are stored on the stack. For
Zen4 this could be improved by using only 2 accumulators. In any
case, the gcc people did something clever here, and I do not
understand how they got there from the source code, and why they did
not get there from keylocks1.c.

For clang's keylocks3-256 the inner loop and the outer loop are each
unrolled two times, resulting in and inner loop like:

190:   vpbroadcastd (%r12,%rbx,4),%ymm5
   vpand %ymm3,%ymm5,%ymm6
   vpand %ymm4,%ymm5,%ymm5
   vpcmpeqd    %ymm1,%ymm5,%ymm5
   vpsubd    %ymm5,%ymm2,%ymm2
   vpcmpeqd    %ymm1,%ymm6,%ymm5
   vpsubd    %ymm5,%ymm0,%ymm0
   vpbroadcastd 0x4(%r12,%rbx,4),%ymm5
   vpand %ymm4,%ymm5,%ymm6
   vpand %ymm3,%ymm5,%ymm5
   vpcmpeqd    %ymm1,%ymm5,%ymm5
   vpsubd    %ymm5,%ymm0,%ymm0
   vpcmpeqd    %ymm1,%ymm6,%ymm5
   vpsubd    %ymm5,%ymm2,%ymm2
   add $0x2,%rbx
   cmp %rbx,%rsi
   jne 190

This results in the lowest AVX2 cycles, and I expect that one can use
that approach without crash problems without adding too many cycles.
The clang -march=x86-64-v4 results have similar code (with twice as
much inner-loop unrolling in case of keylocks3-512), but they all only
use AVX2 instructions and there have been successful runs on a Zen2
(which does not support AVX-512). It seems that clang does not
support AVX-512, or it does not understand -march=x86-64-v4 to allow
more than AVX2.

The least executed instructions is with gcc's keylocks2-512, where the
inner loop is:

230:   vpbroadcastd 0x4(%rax),%zmm4
   vpbroadcastd (%rax),%zmm0
   mov    %edx,%r10d
   add    $0x8,%rax
   add    $0x2,%edx
   vpandd %zmm4,%zmm8,%zmm5
   vpandd %zmm0,%zmm8,%zmm9
   vpandd %zmm4,%zmm6,%zmm4
   vptestnmd    %zmm5,%zmm5,%k1
   vpandd %zmm0,%zmm6,%zmm0
   vmovdqa32    %zmm7,%zmm5{%k1}{z}
   vptestnmd    %zmm9,%zmm9,%k1
   vmovdqa32    %zmm3,%zmm9{%k1}{z}
   vptestnmd    %zmm4,%zmm4,%k1
   vpsubd %zmm9,%zmm5,%zmm5
   vpaddd %zmm5,%zmm2,%zmm2
   vmovdqa32    %zmm7,%zmm4{%k1}{z}
   vptestnmd    %zmm0,%zmm0,%k1
   vmovdqa32    %zmm3,%zmm0{%k1}{z}
   vpsubd %zmm0,%zmm4,%zmm0
   vpaddd %zmm0,%zmm1,%zmm1
   cmp    %r10d,%r8d
   jne    230

Due to UNROLL=32, it deals with 2 zmm registers coming from the outer
loop at a time, and the inner loop is unrolled by a factor of 2, too.
It uses vptestnmd and a predicated vmovdqa32 instead of using vpcmpeqd
(why?). Anyway, the code seems to rub Zen4 the wrong way, and it
performs only at 2.84 IPC, worse than the AVX2 code. Rocket Lake
performs slightly better, but still, the clang code for keylocks2-512
runs a bit faster without using AVX-512.

I also saw one case where the compiler botched it:

gcc -Wall -DUNROLL=16 -O3 -mavx2 -c keylocks3.c

[/tmp/keylock:155546] LC_NUMERIC=prog perf stat -e cycles -e
instructions keylocks3-256 603800000

Performance counter stats for 'keylocks3-256':

17_476_700_581 cycles
39_480_242_683 instructions    # 2.26 insn
per cycle

   3.506995312 seconds time elapsed

   3.507020000 seconds user
   0.000000000 seconds sys

(cycles and timings on the 8700G). Here the compiler failed to
vectorize the comparison, and performed them using scalar instructions
(first extracting the data from the SIMD registers, and finally
inserting the result into SIMD registers, with additional overhead
from spilling registers). The result requires about 10 times more
instructions than the UNROLL=8 variant and almost 20 times more
cycles.

On to timings per routine invocation:

On a 4.4Ghz Haswell (whereas Michael S. measured a 4GHz Haswell):
5.47us clang keylocks1-256 (5.66us for Michael S.'s "original code")
4.26us gcc keylocks1-256 (5.66us for Michael S.'s "original code")
2.38us gcc keylocks2-256 (2.18us for Michael S.'s manual vectorized
code) 2.08us clang keylocks2-512 (2.18us for Michael S.'s manual
vectorized code)

Michael S.'s "original code" performs similar on clang to my
keylocks1.c. clang's keylocks2-512 code is quite competetive with his
manual code.

Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.

On the Golden Cove of a Core i3-1315U (compared to the best result by
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):

I always thought that NUCs have better cooling than all, but high-end
laptops. Was I wrong? Such slowness is disappointing.

5.25us Terje Mathisen's Rust code compiled by clang (best on the
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U

So, for the best-performing variant IPC of Goldeen Cove is identical to
ancient Haswell? That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2
integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck? May
be, a single accumulator? If it is the case, my code should run
about the same as clang's on resource-starved Haswell, but measurably
faster on Goldden Cove.

I would have expected the clang keylocks1-256 to run slower, because
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.

I prefer time, because at the end it's the only thing that matter.

- anton

Date	Sujet	#	Auteur
2 Feb 25	Re: Cost of handling misaligned access	112	BGB
3 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	109	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	11	BGB
3 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
3 Feb 25	Re: Cost of handling misaligned access	1	BGB
3 Feb 25	Re: Cost of handling misaligned access	8	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	7	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	5	Thomas Koenig
4 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
4 Feb 25	Re: Cost of handling misaligned access	2	Thomas Koenig
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
10 Feb 25	Re: Cost of handling misaligned access	1	Mike Stump
4 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
3 Feb 25	Re: Cost of handling misaligned access	3	Thomas Koenig
3 Feb 25	Re: Cost of handling misaligned access	2	BGB
3 Feb 25	Re: Cost of handling misaligned access	1	MitchAlsup1
4 Feb 25	Re: Cost of handling misaligned access	41	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	40	Terje Mathisen
5 Feb 25	Re: Cost of handling misaligned access	4	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
5 Feb 25	Re: Cost of handling misaligned access	35	Michael S
6 Feb 25	Re: Cost of handling misaligned access	32	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	31	Michael S
6 Feb 25	Re: Cost of handling misaligned access	2	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	28	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	27	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	26	Michael S
6 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	19	Michael S
7 Feb 25	Re: Cost of handling misaligned access	18	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	17	Michael S
7 Feb 25	Re: Cost of handling misaligned access	16	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	15	Michael S
7 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
7 Feb 25	Re: Cost of handling misaligned access	3	MitchAlsup1
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
8 Feb 25	Re: Cost of handling misaligned access	10	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
8 Feb 25	Re: Cost of handling misaligned access	6	Michael S
8 Feb 25	Re: Cost of handling misaligned access	5	Anton Ertl
8 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	2	Michael S
11 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
9 Feb 25	Re: Cost of handling misaligned access	1	Michael S
10 Feb 25	Re: Cost of handling misaligned access	1	Michael S
7 Feb 25	Re: Cost of handling misaligned access	5	BGB
7 Feb 25	Re: Cost of handling misaligned access	4	MitchAlsup1
7 Feb 25	Re: Cost of handling misaligned access	3	BGB
8 Feb 25	Re: Cost of handling misaligned access	2	Anssi Saari
8 Feb 25	Re: Cost of handling misaligned access	1	BGB
6 Feb 25	Re: Cost of handling misaligned access	2	Terje Mathisen
6 Feb 25	Re: Cost of handling misaligned access	1	Michael S
6 Feb 25	Re: Cost of handling misaligned access	5	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	3	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	2	Waldek Hebisch
6 Feb 25	Re: Cost of handling misaligned access	1	Anton Ertl
6 Feb 25	Re: Cost of handling misaligned access	1	Terje Mathisen
13 Feb 25	Re: Cost of handling misaligned access	48	Marcus
13 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
14 Feb 25	Re: Cost of handling misaligned access	41	BGB
14 Feb 25	Re: Cost of handling misaligned access	40	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	39	BGB
18 Feb 25	Re: Cost of handling misaligned access	33	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	1	BGB
18 Feb 25	Re: Cost of handling misaligned access	31	Michael S
18 Feb 25	Re: Cost of handling misaligned access	1	Thomas Koenig
18 Feb 25	Re: Cost of handling misaligned access	26	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	25	Terje Mathisen
18 Feb 25	Re: Cost of handling misaligned access	24	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	23	Terje Mathisen
19 Feb 25	Re: Cost of handling misaligned access	22	MitchAlsup1
19 Feb 25	Re: Cost of handling misaligned access	21	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
20 Feb 25	Re: Cost of handling misaligned access	5	MitchAlsup1
20 Feb 25	Re: Cost of handling misaligned access	2	BGB
20 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	2	Robert Finch
21 Feb 25	Re: Cost of handling misaligned access	1	BGB
21 Feb 25	Re: Cost of handling misaligned access	14	BGB
22 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
22 Feb 25	Re: Cost of handling misaligned access	12	Robert Finch
23 Feb 25	Re: Cost of handling misaligned access	10	BGB
23 Feb 25	Re: Cost of handling misaligned access	9	Michael S
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	7	Michael S
24 Feb 25	Re: Cost of handling misaligned access	4	Robert Finch
24 Feb 25	Re: Cost of handling misaligned access	1	BGB
24 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
25 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
25 Feb 25	Re: Cost of handling misaligned access	1	BGB
23 Feb 25	Re: Cost of handling misaligned access	1	Robert Finch
18 Feb 25	Re: Cost of handling misaligned access	3	BGB
19 Feb 25	Re: Cost of handling misaligned access	2	MitchAlsup1
18 Feb 25	Re: Cost of handling misaligned access	5	Robert Finch
17 Feb 25	Re: Cost of handling misaligned access	5	Terje Mathisen