On 1/3/25 12:24 PM, Scott Lurndal wrote:
EricP <ThatWouldBeTelling@thevillage.com> writes:
[snip]
For MMIO device registers I think having an explicit SPCB instruction
might be better than putting a "no-speculate" flag on the PTE for the
device register address as that flag would be difficult to propagate
backwards from address translate to all the parts of the core that
we might have to sync with.
MMIO accesses are, by definition, non-cachable, which is typically
designated in either a translation table entry or associated
attribute registers (MTTR, MAIR). Non-cacheable accesses
are not speculatively executed, which provides the
correct semantics for device registers which have side effects
on read accesses.
It is not clear to me that Memory-Mapped I/O requires
non-cacheable accesses. Some addresses within I/O device
address areas do not have access side effects. I would **GUESS**
that most I/O addresses do not have read side effects.
(One obvious exception would be implicit buffers where a read
"pops" a value from a queue allowing the next value to be accessed
at the same address. _Theoretically_ one could buffer such reads
outside of the I/O device such that old values would not be lost
and incorrect speculation could be rolled back — this might be a
form of versioned memory. Along similar lines, values could be
prefetched and cached as long as all modifiers of the values use
cache coherency. There may well be other cases of read side
effects.)
In general writes require hidden buffering for speculation, but
write side effects can affect later reads. One possibility would
be a write that changes which buffer is accessed at a given
address. Such a write followed by a read of such a buffer address
must have the read presented after the write, so caching the read
address would be problematic.
One weak type of write side effect would be similar to releasing
a lock, where with a weaker memory order one needs to ensure that
previous writes are visible before the "lock is released". E.g.,
one might update a command buffer on an I/O device with multiple
writes and lastly update a I/O device pointer to indicate that
the buffer was added to. The ordering required for this is weaker
than sequential consistency.
If certain kinds of side effects are limited to a single device,
then the ordering of accesses to different devices may allow
greater flexibility in ordering. (This seems conceptually similar
to cache coherence vs. consistency where "single I/O device"
corresponds to single address. Cache coherence provides strict
consistency for a single address.)
I seem to recall that StrongARM exploited a distinction between
"bufferable" and "cacheable" marked in PTEs to select the cache
to which an access would be allocated. This presumably means
that the two terms had different consistency/coherence
constraints.
I am very skeptical that an extremely complex system with best
possible performance would be worthwhile. However, I suspect that
some relaxation of ordering and cacheability would be practical
and worthwhile.
I do very much object to requiring memory-mapped I/O as a
concept to require non-cacheability even if existing software
(and hardware) and development mindset makes any relaxation
impractical.
Since x86 allowed a different kind of consistency for non-temporal
stores, it may not be absurd for a new architecture to present
a more complex interface, presumably with the option not to deal
with that complexity. Of course, the most likely result would be
hardware having to support the complexity with not actual benefit
from use.