Sujet : Re: MM instruction and the pipeline
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 20. Oct 2024, 16:57:59
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vf39a8$fr1q$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 10/16/24 5:14 PM, MitchAlsup1 wrote:
On Wed, 16 Oct 2024 20:48:39 +0000, Paul A. Clayton wrote:
>
Here is a question that I will leave to Mitch:
>
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
A Memory Move does not <necessarily> read the destination. In
order to make the data transfers occur in cache line sizes,
The first and the last line may be read, but the intermediate
ones are not read (from DRAM) only to be re-written. An
implementation with byte write enables might not read any
of the destination lines.
I was referring to a following instruction reading the
destination.
Then there is the issue with uncorrectable errors at the
receiving cache. The current protocol has the sender (core)
not release his write buffer until LLC has replied that
the data arrived without ECC trouble. Thus, the instruction
causing the latent uncorrectable error is not retired until
the data has arrived successfully at LLC.
I was thinking primarily about uncorrectable errors in the
source. It would be convenient to software for MM to fail early on
an uncorrectable source error. It would be a little less
convenient (and possibly more complex in hardware) to generate an
ECC exception at the end of the MM instruction (or when it
pauses from a context switch).
With source-signaled errors, MM might be used to scrub memory
(assuming the microarchitecture did not optimize out copies
onto self as nops☺).
Not signaling the error until the destination is read some time
later prevents software from assuming the copy was correct at the
time of copying, but allows a copy to commit once all permissions
have been verified.
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
As I mentioned before, Yes I intend to allow other instructions
to operate concurrently with MM, but I also expect MM to consume
all of L1 cache bandwidth. Just like LD L1-L2-miss operates
concurrently with FDIV.
For large copies, I could see having the copying done at L2 or
even L3 with distinct address generation (at least within a page,
possibly crossing page boundaries if associated with a prefetcher
that crosses page boundaries and so does address translation).
A stride based prefetcher would have the address generation
capability to process the streaming access of MM.
If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.
In practice, one is not going to allow MM to get farther than
the miss buffer ahead of a mispredict shadow.
Once the MM itself is non-speculative (i.e., branches/exceptions
in the path to it have all resolved), it seems an MM could
progress as far as permissions have been confirmed.
With its synchronous interface, it seems that for sufficiently
large MM operations one might want to context switch the MM off of
a high performance core and onto a MM engine so that the core
could be used for other work.
Given the cache misses (including predictors) from a context
switch, this might not be worthwhile even if the MM engine was
substantially more energy efficient.
Forcing a thread to pause while a "large" MM operation is done
_feels_ wrong.
I guess one could architect a fork+terminate operation that
would allow a non-speculative but physically incomplete copy to
commit and then "end" that thread while continuing in a "new"
thread from the instruction after the fork. Such an interface
seems clunky but might be more general than just having a
quasi-synchronous MM with the similar behavior.