Liste des Groupes | Revenir à c arch |
On Thu, 6 Feb 2025 17:47:30 +0100I did mention that this is a (cheap) laptop? It is about 15 months old, and with a base frequency of 2.676 GHz. I guess that would explain most of the difference between what I see and what you expected?
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Terje Mathisen wrote:Assuming that your CPU is new and runs at decent frequency (4-4.5 GHz),Michael S wrote:>The point of my proposal is not reduction of loop overhead and not>
reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of
uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
>
 VPAND reg,reg,[mem]
>
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in the
inner loop: The fastest time I saw actually did drop a smidgen, from
5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
timing resolution I get from the Rust run_benchmark() function.
>
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
>
Terje
>
the results are 2-3 times slower than expected. I would guess that it
happens because there are too few iterations in the inner loop.
Turning unrolling upside down, as I suggested in the previous post,
should fix it.
Very easy to do in C with intrinsic. Probably not easy in Rust.
Les messages affichés proviennent d'usenet.