Sujet : Re: Cost of handling misaligned access
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.archDate : 06. Feb 2025, 14:28:08
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20250206152808.0000058f@yahoo.com>
References : 1 2 3 4 5 6 7 8
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Thu, 06 Feb 2025 10:59:39 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:
This resulted in just 12 instructions to handle 32 tests.
>
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number
of memory accesses per comparison.
Looking at the inner loop code shown in
<2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
not include the loop overhead and are already unrolled by a factor of
4 (32 for the scalar code). The loop overhead is 3 instructions, for
a total of 15 instructions per iteration.
The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is
here.
Even if you completely eliminate the loop overhead, the number of
instructions is reduced by at most a factor 1.25, and I expect that
the speedup from further unrolling is a factor of at most 1 on most
CPUs (factor <1 can come from handling the remaining elements slowly,
which does not seem unlikely for code coming out of gcc and clang).
- anton
The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal the #
of x86 instructions is increased), but reduction in # of uOps due to
reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that flows
through rename stage.
Not counting loop overhead, an original 1x4 inner loop consists of 12
instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
does the same amount of work. New inner loop contains only RISC-like
instructions - 14 instructions, 14 uOps.
With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
work done per iteration.
Another factor that can contribute to a speedup is increased number
of iterations in the inner loop - from 1..7 iterations in original to
1..15 in both of my above mentioned variants.
Yet another possibility is to follow "work harder not smarter"
principle, i.e. process the whole square rather than just a relevant
triangle. The main gain is that loop detector would be able to predict
the # of iterations in the inner loop, avoiding mispredicted branch at
the end. If we follow this pass then it probably makes sense to
not unroll an inner loop beyond SIMD factor of 8 and instead unroll an
outer loop by 4.
Going by intuition, in this particular application "smarter" wins
over "harder", but we know that intuition sucks. Including mine :(