Sujet : Re: Cost of handling misaligned access
De : antispam (at) *nospam* fricas.org (Waldek Hebisch)
Groupes : comp.archDate : 06. Feb 2025, 16:58:07
Autres entêtes
Organisation : To protect and to server
Message-ID : <vo2m6d$20glj$1@paganini.bofh.team>
References : 1 2 3 4 5 6
User-Agent : tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64))
Anton Ertl <
anton@mips.complang.tuwien.ac.at> wrote:
antispam@fricas.org (Waldek Hebisch) writes:
Concerning SIMD: trouble here is increasing vector length and
consequently increasing alignment requirements.
That is not a necessary consequence, on the contrary: alignment
requirements based on SIMD granularity is hardware designer lazyness,
but means that SIMD cannot be used for many of the applications where
SIMD without that limitation can be used.
If you want to have alignment checks, then a SIMD instruction should
check for element alignment, not for SIMD alignment.
But the computer architecture trend is clear: General-purpose
computers do not have alignment restrictions; all that had them have
been discontinued; the last one that had them was SPARC.
Trend is clear, but there is a question: is it good trend.
You wrot about lazy hardware designers, but there is much
more lazy programmers. There are situations when unaligned
access is needed, but significant proportion of unaligned
accesses is not needed at all. At best such unaligned
accesses lead to small performance loss, but they may also
be latent bugs. There are cases when unaligned accesses
are better than aligned ones, for that architecture
should have apropriate instructions.
A lot of SIMD
code is memory-bound and current way of doing misaligned
access leads to worse performance. So really no good way
to solve this. In principle set of buffers for 2 cache lines
each and appropriate shifters could give optimal troughput,
but probably would lead to increased latency.
AFAIK that's what current microarchitectures do, and in many cases
with small penalties for unaligned accesses; see
https://www.complang.tuwien.ac.at/anton/unaligned-stores/
You call doubling store time 'small penalty'. For me in
performance critical loop 10% matter and it is worth
aligning things to avoid such loss. And what you present
does not look like what I wrote above: AFAICS what Intel
do is within single cache line and there is penalty when
crossing lines (with 2 cache lines buffers there would be
no penalty for line crossing).
For me much more important are loads. First, there is more of
them. Second, stores can be buffered and latency of store itself
is of little importance (latency from store to load matters).
For loads extra things in load path increase latency and that
may limit program speed.
-- Waldek Hebisch