Liste des Groupes | Revenir à c arch |
Michael S <already5chosen@yahoo.com> writes:On Thu, 06 Feb 2025 10:59:39 GMT...
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Michael S <already5chosen@yahoo.com> writes:This resulted in just 12 instructions to handle 32 tests.>
That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the
number of memory accesses per comparison.The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of uOps
due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
>
Not counting loop overhead, an original 1x4 inner loop consists of 12
instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
does the same amount of work. New inner loop contains only RISC-like
instructions - 14 instructions, 14 uOps.
With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
work done per iteration.
I completely missed the "outer" in your response. Yes, looking at the
original loop again:
let mut part1 = 0;
for l in li..keylocks.len() {
let lock = keylocks[l];
for k in 0..li {
let sum = lock & keylocks[k];
if sum == 0 {
part1 += 1;
}
}
}
you can reuse the keylocks[k] value by unrolling the outer loop.
E.g., if you unroll the outer loop by a factor of 4 (and the inner
loop not beyond getting SIMD width), you can use almost the same code
as clang produces, but you load keylocks[0..li] 4 times less often,
and if the bottleneck for the inner-loop-only-optimized variant is
bandwidth to the memory subsystem (it seems that Terje Mathisen worked
with up to 62500 values, i.e., 250KB, i.e. L2), which is likely, there
may be quite a bit of speedup.
E.g., for Zen5 the bandwidth to L2 is reported to be 32 bytes/cycle,
which would limit the performance to need at least 4 cycles/iteration
(3.75 IPC), possibly less due to misalignment handling overhead, and
using AVX-512 would not help, whereas with reusing the loaded value
the limit would probably be resources, and AVX-512 would see quite a
bit of speedup over AVX2.
Another factor that can contribute to a speedup is increased number
of iterations in the inner loop - from 1..7 iterations in original to
1..15 in both of my above mentioned variants.
Yes. I actually don't see a reason to unroll the inner loop more than
needed for the SIMD instructions at hand, unless the number of
outer-loop iterations is too small. If you want more unrolling,
unroll the outer loop more.
Yet another possibility is to follow "work harder not smarter"
principle, i.e. process the whole square rather than just a relevant
triangle.
I don't see a triangle in the code above. There may be some more
outer loop involved that varies li from 0 to keylocks.len() or
something, but the code that is presented processes a square.
- anton
Les messages affichés proviennent d'usenet.