Liste des Groupes | Revenir à c arch |
Terje Mathisen wrote:Michael S wrote:The point of my proposal is not reduction of loop overhead and not
reduction of the # of x86 instructions (in fact, with my proposal
the # of x86 instructions is increased), but reduction in # of
uOps due to reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that
flows through rename stage.
Aha! I see what you mean: Yes, this would be better if the
VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the
size of the arrays were just 1000 bytes each (250 keys + 250
locks), everything fits easily in $L1. (BTW, I did try to add 6
dummy keys and locks just to avoid any loop end overhead, but that
actually ran slower.)
I've just tested it by running either 2 or 4 locks in parallel in the
inner loop: The fastest time I saw actually did drop a smidgen, from
5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
timing resolution I get from the Rust run_benchmark() function.
So yes, it is slightly better to run a stripe instead of just a
single row in each outer loop.
Terje
Les messages affichés proviennent d'usenet.