Liste des Groupes | Revenir à c arch |
Michael S wrote:On Thu, 6 Feb 2025 17:47:30 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:Michael S wrote:>The point of my proposal is not reduction of loop overhead and>
not reduction of the # of x86 instructions (in fact, with my
proposal the # of x86 instructions is increased), but reduction
in # of uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very
high IPC like the one above the main bottleneck is the # of uOps
that flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
>
 VPAND reg,reg,[mem]
>
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in
the inner loop: The fastest time I saw actually did drop a
smidgen, from 5800 ns to 5700 ns (for both 2 and 4 wide), with 100
ns being the timing resolution I get from the Rust run_benchmark()
function.
>
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
>
Terje
Assuming that your CPU is new and runs at decent frequency (4-4.5
GHz), the results are 2-3 times slower than expected. I would guess
that it happens because there are too few iterations in the inner
loop. Turning unrolling upside down, as I suggested in the previous
post, should fix it.
Very easy to do in C with intrinsic. Probably not easy in Rust.
I did mention that this is a (cheap) laptop? It is about 15 months
old, and with a base frequency of 2.676 GHz.
I guess that would
explain most of the difference between what I see and what you
expected?
BTW, when I timed 1000 calls to that 5-6 us program, to get around
teh 100 ns timer resolution, each iteration ran in 5.23 us.
Terje
>
Les messages affichés proviennent d'usenet.