Sujet : Re: Cost of handling misaligned access
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 06. Feb 2025, 19:19:09
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb6.191909@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7
User-Agent : xrn 10.11
antispam@fricas.org (Waldek Hebisch) writes:
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
But the computer architecture trend is clear: General-purpose
computers do not have alignment restrictions; all that had them have
been discontinued; the last one that had them was SPARC.
>
Trend is clear, but there is a question: is it good trend.
You wrot about lazy hardware designers, but there is much
more lazy programmers.
Lazy programmers use high-level languages which align everything
anyway.
There are situations when unaligned
access is needed, but significant proportion of unaligned
accesses is not needed at all.
What evidence do you have for this claim?
At best such unaligned
accesses lead to small performance loss,
They may also lead to a small performance win.
I tried to turn on alignment checks:
First on IA-32: There I found that memcpy() etc. uses unaligned
accesses, but I could replace these functions. But then I found that
8-byte FP numbers are aligned at 4-byte boundaries because the ABI
says so, but the alignment check faults in that case. So I gave up on
turning on alignment checks.
Later on AMD64: The ABI does not have that bug there, and I worked
around memcpy() etc. However, I found that gcc produced unaligned
2-byte accesses (rather than 2 1-byte accesses) for things like
strcpy("w",var). I did not find a way to suppress that code
generation freature of gcc, so I gave up on this attempt.
Did I find any cases on AMD64 where I think there will be a
performance loss? No, on the contrary, I expect that, on average, the
2-byte acceses will be faster than two one-byte accesses. And
unaligned accesses on memcpy() are clearly a win over accessing the
memory byte-by-byte.
There are cases when unaligned accesses
are better than aligned ones, for that architecture
should have apropriate instructions.
SSE has MOVDQU and MOVDQA. MOVDQA is completely pointless, because it
checks for 16-byte alignment, rather than element alignment. If
designed properly, it would have MOVDQ2A, MOVDQ4A, MOVDQ8A. But do we
actually need it? The experience mentioned above indicates that we
don't.
You call doubling store time 'small penalty'. For me in
performance critical loop 10% matter and it is worth
aligning things to avoid such loss.
The question is how much of the loop is spent in loads and stores, and
how do you avoid the unaligned accesses: E.g., for the case mentioned
earlier
for (i=0; i<n; i++)
a[i] = b[i] + c[i];
For performance reasons, you want to use SIMD instructions for that
and align each SIMD memory access to SIMD granularity. But what if a,
b, c have different starting points modulo the SIMD granularity?
For me much more important are loads.
My data for loads is older (and for older hardware):
<
http://al.howardknight.net/?ID=143135464800>. But the links to the
benchmarks are there, you can measure it on modern hardware. Maybe I
will find the time at some point and measure modern hardware.
First, there is more of
them. Second, stores can be buffered and latency of store itself
is of little importance (latency from store to load matters).
For loads extra things in load path increase latency and that
may limit program speed.
I notice that the SiFive CPUs have no proper hardware support for
unaligned accesses and have much lower clock rate than the Intel, AMD,
ARM, Apple, and Qualcomm cores that support unaligned accesses. So
the evidence does not support your claim.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>