Sujet : Re: Arguments for a sane ISA 6-years later
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 27. Jul 2024, 00:01:43
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v819ss$31ob5$1@dont-email.me>
References : 1 2 3 4 5
User-Agent : Mozilla Thunderbird
On 7/26/2024 3:59 PM, MitchAlsup1 wrote:
On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
On 7/25/2024 1:09 PM, BGB wrote:
At least with a weak model, software knows that if it doesn't go through
the rituals, the memory will be stale.
>
There is no guarantee of staleness, only a lack of stronger ordering
guarantees.
>
The weak model is ideal for me. I know how to program for it
>
And the fact that this model is so hard to use that few others know
how to program for it make it ideal for you.
>
and it's more efficient
>
That depends on the hardware.
>
Yes, the Alpha 21164 with its imprecise exceptions was "more
efficient" than other hardware for a while, then the Pentium Pro came
along and gave us precise exceptions and more efficiency. And
eventually the Alpha people learned the trick, too, and 21264 provided
precise exceptions (although they did not admit this) and more
efficieny.
>
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.
According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain
performance when not doing ATOMIC stuff, and gain programmability when
doing atomic stuff.
Probably true.
Main thing that matters for consistency is things like mutex locks and shared buffers.
Often, for most everything else, consistency can be glossed over.
In a traditional weak model, one would flush the caches whenever locking a mutex or similar (to make sure that everything is written out before locking, and post-locking, up to date).
Where, here, one assumes that the only time memory is necessarily up to date, is after acquiring a mutex lock (but, for good measure, one can also do a flush when releasing the lock, such that any other threads that gain the lock will have an up-to-date view of anything that happened between gaining and releasing the mutex).
If a person is clever, they might try to sidestep the needs for the cache-flushing here, but this may result in any shared memory not being up to date.
This works a little better though if one assumes that any actively shared buffers are essentially read-only during the time each thread is doing its work (potentially followed by a consolidation phase, where all the threads flush their caches such that their view of memory is brought back in sync).
That's because software written for weakly
consistent hardware often has to insert barriers or atomic operations
just in case, and these operations are slow on hardware optimized for
weak consistency.
The operations themselves are not slow. What is slow is delaying the
pipeline until it catches up to the stronger memory model before
proceeding.
How I attempted to do no-cache/volatile/atomic operations was roughly:
If the cache line seen is not marked as volatile:
Flush it;
Fetch the line from memory, set it marked as volatile;
Do the operation;
Set up a mechanism to auto-flush the line.
If a volatile line is seen and we are not doing a volatile operation, flush it.
Auto-flush: Once we are not doing the volatile memory operation, look again at cache line, see that it is volatile, and flush it.
It is vaguely similar for TLB misses, where a TLB miss will load "whatever" from memory, but it is flagged so that the cache will auto-flush it at the nearest opportunity once the offending operation completes (though, with the slight difference that TLB Missed lines are unable to be marked as Dirty in a Store operation).
>
By contrast, one can design hardware for strong ordering such that the
slowness occurs only in those cases when actual (not potential)
communication between the cores happens, i.e., much less frequently.
How would you do this for a 256-way banked memory system of the
NEC SX ?? I.E., the processor is not in charge of memory order--
the memory system is.
I have little idea personally how something like TSO could scale to manycore systems, or to systems where there is non-trivial communication latency (such as when threads are running across a LAN, or maybe the internet).
Meanwhile, weak consistency models are easier to scale up to high latency.
Say, as opposed to local cache flushing, the shared mutex lock effectively involves sending all of the dirty pages back to a server over a TCP socket or something; followed by re-downloading any of the shared pages afterwards (though, maybe with the server being able to signal which pages on the server might still be up-to-date from the client's POV and avoid the need to re-download them).
Granted, for high latency contexts, generally message-passing tends to become preferable to shared memory; but definitions here can get fuzzy.
Like, depending on how one looks at it, multiple players connected to a Minecraft server could be considered as a usage of a high-latency shared memory system (with the Minecraft terrain being essentially a sort of shared memory; albeit expressed via message passing over TCP/IP).
Though, for general use, it would make sense to limit the scope of "shared memory" to something more resembling a traditional "mmap()" style operation (with a mechanism to detect when pages are dirty and to then re-synchronize them).
>
and sometimes use cases do not care if they encounter "stale" data.
>
Great. Unless these "sometimes" cases are more often than the cases
where you perform some atomic operation or barrier because of
potential, but not actual communication between cores, the weak model
is still slower than a well-implemented strong model.
>
- anton