Sujet : Re: Efficiency of in-order vs. OoO
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 09. Mar 2024, 12:27:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <ushh35$29t5q$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 3/8/24 11:01 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
[snip]
For memory reads, the late failure generated by an uncorrectable
ECC error would probably have to be handled differently or there
would probably be little opportunity to exploit out-of-order
retirement. It might not be entirely unreasonable to treat such as
a fatal thread error that is asynchronous.
What about for memory stores where the ECC check on the delivered data fails ?? This seems to be just as fatal as a LD with an ECC fail.
Stores are less artificially handled asynchronously (e.g., on next
read). An ECC correction/parity check that fails when evicting
from L1 is similarly asynchronous with the store(s) that made that
line dirty.
Yes, it would be better to have more information about when and
even how and to have that information sooner. This would seem to
be particularly useful/important in distinguishing fully transient
and rare errors in the memory system from more persistent errors
in the store functionality of the core.
With precise exceptions for such errors, the thread could be
transferred to another core. (Transferring a thread in out-of-
order state would be possible but would be less simple than
recovering at a checkpoint.)
Quasi-retirement that is out-of-order while maintaining precise
exceptions would be possible with coarse-grained checkpointing.
"Out-of-Order Commit Processors" (Adrian Cristal et al, 2004) even
called this "commit" — which I view as inaccurate since a rollback
is possible. There might also be a benefit in involving software
in the checkpointing (to moderate state preserving overhead) and
even the replay process (sometimes failing a task makes more sense
than retrying from the middle, e.g., when the retry would deliver
a late result or the broader context result is already good enough
and the resources could be more profitably spent than retrying to
get a slightly better result in the broader context). Cloud-
targeted software already offloads some RAS management from
hardware; such might perhaps be more generalized.