Liste des Groupes | Revenir à c arch |
Michael S wrote:I've just tested it by running either 2 or 4 locks in parallel in the inner loop: The fastest time I saw actually did drop a smidgen, from 5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the timing resolution I get from the Rust run_benchmark() function.The point of my proposal is not reduction of loop overhead and notAha! I see what you mean: Yes, this would be better if the
reduction of the # of x86 instructions (in fact, with my proposal the #
of x86 instructions is increased), but reduction in # of uOps due to
reuse of loaded values.
The theory behind it is that most typically in code with very high
IPC like the one above the main bottleneck is the # of uOps that flows
through rename stage.
VPAND reg,reg,[mem]
instructions actually took more than one cycle each, but as the size of the arrays were just 1000 bytes each (250 keys + 250 locks), everything fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just to avoid any loop end overhead, but that actually ran slower.)
Les messages affichés proviennent d'usenet.