Liste des Groupes | Revenir à c arch |
mitchalsup@aol.com (MitchAlsup1) writes:We still have a 2 cycle loop recurrence, so even if we could performI do not see 2 LDDs being performed parallel unless the execution>
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement).
Yes. But with loads that take longer than two cycles (very common in
OoO microarchitectures even for L1 hits), the second load starts
before the first finishes. And in the case where the branchy version
is profitable (when the load latency longer than the misprediction
penalty due to cache misses), many loads will start before the first
finishes (most of them will be canceled due to misprediction, but even
an average of two useful parallel loads produces a good speedup).
[EricP:]That is difficult with renaming. In order for the later instructions>>[*] I want to see the asm because Intel's CMOV always executes the
operand operation, then tosses the result if the predicate is false.
Use a less-stupid ISA
The ISA does not require that. It could just as well be implemented
as waiting for the condition, and only then perform the operation.
And with a more sophisticated implementation one could even do that
for operations that are not part of the CMOV instruction, but produce
one of the source operands of the CMOV instruction. However,
apparently such implementations have enough disadvantages (probably in
performance) that nobody has gone there AFAIK. AFAIK everyone,
including implementations of different ISAs implements
CMOV/predication as performing the operation and then conditionally
squashing the result.
>
- anton
Les messages affichés proviennent d'usenet.