Sujet : Re: Cost of handling misaligned access
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 06. Feb 2025, 11:59:39
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Feb6.115939@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7
User-Agent : xrn 10.11
Michael S <
already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
>
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of
memory accesses per comparison.
Looking at the inner loop code shown in
<
2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
not include the loop overhead and are already unrolled by a factor of
4 (32 for the scalar code). The loop overhead is 3 instructions, for
a total of 15 instructions per iteration.
The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Even if you completely eliminate the loop overhead, the number of
instructions is reduced by at most a factor 1.25, and I expect that
the speedup from further unrolling is a factor of at most 1 on most
CPUs (factor <1 can come from handling the remaining elements slowly,
which does not seem unlikely for code coming out of gcc and clang).
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>