Newsportal USENET - Re: MM instruction and the pipeline

On 10/16/24 1:56 AM, Stephen Fuld wrote:

Even though this is about the MM instruction, and the MM instruction is mentioned in other threads, they have lots of other stuff (thread drift), and this isn't related to C, standard or otherwise, so I thought it best to start a new thread,
My questions are about what happens to subsequent instructions that immediately follow the MM in the stream when an MM instruction is executing. Since an MM instruction may take quite a long time (in computer time) to complete I think it is useful to know what else can happen while the MM is executing.

This would seem to be very implementation dependent.
Architecturally, no following instructions can execute until after
the MM completes. With respect to microarchitecture, an arbitrary
amount of parallelism could be provided.

I will phrase this as a series of questions.

While Mitch Alsup can answer these more authoritatively, I will
take a stab at them.

1. I assume that subsequent non-memory reference instructions can proceed simultaneously with the MM. Is that correct?

This would probably be true even for the in-order scalar
implementation.

2. Can a load or store where the memory address is in neither the source nor the destination of the MM proceed simultaneously with the MM

This is a little more complicated than just marking a register as
not-ready (for a load destination), so might not be supported in
a simple implementation. Memory accesses would have to check both
ranges rather than just one of 32 register names or eight store
buffer entries.
Mitch Alsup's description of the small quasi-scalar core implies
to me that the MM instruction would occupy the memory access
interface until it is finished.
I would guess that any out-of-order implementation would support
loads and stores outside of the MM regions to proceed
speculatively until the various OoO buffering structures are
filled.

3. Can a load where the memory address is within the source of the MM proceed?

My guess would be that any OoO implementation would support this.
If the implementation checks for a hit in both ranges, it would
seem to be little extra effort to allow a load to a 'clean'
address to proceed.
Supporting this and preventing reads of the destination and all
stores would only require one address range check; loads can
proceed as long as they are not within the destination.

For the next questions, assume for exposition that the MM has proceeded to complete 1/3 of the move when the following instructions come up.
4. Can a load in the first third of the destination range proceed?

I would guess that an out-of-order implementation would forward
data from all stores performed speculatively by the MM (limited by
the store queue). MM stores that are no longer speculative — where
an interrupt would place the count — would seem to be naturally
handled as if singular committed stores, i.e., following
instructions could speculatively execute using those values.

5. Can a store in the first third of the source range proceed?

In the non-speculative region of the MM, speculative stores could "execute", storing to the store queue. These stores would be squashed if the MM does not fully complete along with all other instructions after the MM. The MM is synchronous.
A large MM that is no longer speculative might be implemented as
avoiding the store queue to allow more stores after the MM to be
speculated. For very large MMs, a copy engine farther from the
core might be used.

6. Can a store in the first third of the destination range proceed?

Since the MM has architecturally completed to roughly that point (some stores might only have "completed" to the store queue), it
would not be difficult to support speculative stores in the
completed range for an out-of-order implementation. These stores
would be rolled back if the MM does not fully complete and commit.
Here is a question that I will leave to Mitch:
Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?
I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.
If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.
For page-aligned copies, a copy-on-write mechanism might be used.
There are also cache designs which support deduplication; cache block aligned copies might be faster than physical copying. With lossy/truncated cache compression, unaligned fragments might be
deduplicated (and read-for-ownership might be avoided similar to
having fine-grained valid bits).
I rather suspect that what is physically possible is far broader than what is possible with a finite engineering budget.

Date	Sujet	#	Auteur
16 Oct 24	MM instruction and the pipeline	13	Stephen Fuld
16 Oct 24	Re: MM instruction and the pipeline	3	MitchAlsup1
17 Oct 24	Re: MM instruction and the pipeline	2	Stephen Fuld
17 Oct 24	Re: MM instruction and the pipeline	1	MitchAlsup1
16 Oct 24	Re: MM instruction and the pipeline	9	Paul A. Clayton
16 Oct 24	Re: MM instruction and the pipeline	8	MitchAlsup1
20 Oct 24	Re: MM instruction and the pipeline	7	Paul A. Clayton
21 Oct 24	Re: MM instruction and the pipeline	3	MitchAlsup1
21 Oct 24	Re: MM instruction and the pipeline	1	Stephen Fuld
22 Oct 24	Re: MM instruction and the pipeline	1	MitchAlsup1
21 Oct 24	Re: MM instruction and the pipeline	3	Anton Ertl
21 Oct 24	Re: MM instruction and the pipeline	2	Michael S
22 Oct 24	Re: MM instruction and the pipeline	1	MitchAlsup1