On 11/15/2024 6:51 PM, MitchAlsup1 wrote:
On Fri, 15 Nov 2024 23:35:22 +0000, BGB wrote:
On 11/15/2024 4:05 PM, Chris M. Thomasson wrote:
On 11/15/2024 12:53 PM, BGB wrote:
On 11/15/2024 11:27 AM, Anton Ertl wrote:
jseigh <jseigh_es00@xemaps.com> writes:
Anybody doing that sort of programming, i.e. lock-free or distributed
algorithms, who can't handle weakly consistent memory models, shouldn't
be doing that sort of programming in the first place.
>
Do you have any argument that supports this claim.
>
Strongly consistent memory won't help incompetence.
>
Strong words to hide lack of arguments?
>
>
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
>
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
[...]
>
A TSO from a weak memory model is as it is. It should not necessarily
perform "worse" than other systems that have TSO as a default. The
weaker models give us flexibility. Any weak memory model should be able
to give sequential consistency via using the right membars in the right
places.
>
>
The speed difference is mostly that, in a weak model, the L1 cache
merely needs to fetch memory from the L2 or similar, may write to it
whenever, and need not proactively store back results.
>
As I understand it, a typical TSO like model will require, say:
Any L1 cache that wants to write to a cache line, needs to explicitly
request write ownership over that cache line;
The cache line may have been fetched from a core which modified the
data, and handed this line directly to this requesting core on a
typical read. So, it is possible for the line to show up with
write permission even if the requesting core did not ask for write
permission. So, not all lines being written have to request owner-
ship.
OK.
I think the bigger distinction, is more that a concept of write ownership exists in the first place...
In my current memory model, there is no concept of write ownership.
Ironically, this also means the RISC-V LR/SC instructions don't make sense in my memory model, but this hasn't been a huge loss (they just sort of behave as-if they worked).
Any attempt by other cores to access this line,
You are being rather loose with your time analysis in this question::
Access this line before write permission has been requested,
or
Access this line after write permission has been requested but
before it has arrived,
or
Access this line after write permission has arrived.
Yeah. I didn't really distinguish these cases...
May possibly be different in a cache system where events are processed sequentially, rather than circling around in a ring bus (and processed in whatever way the requests happen to hit the L2 cache or similar).
Say, request comes in for address 123 from core B:
Write ownership held by A?
Send request to A to Flush 123;
Flag 123 as the flush having been requested;
To avoid repeating the request.
Ignore B's request for now (it then circles the bus);
Write ownership not held?
If the request was for write privilege:
Mark as held by B;
Send response to B's request.
If A receives a flush request:
Flush the cache line in question;
Write modified data, or sense an FLUSH_ACK response or similar.
When L2 receives response:
Write data back to L2 if needed;
Mark cache line as no longer held.
Less obvious what happens if an L2 miss happens and the line at that location is still held.
Would presumably need all cores to flush any dirty lines before they could be safely evicted from the L2 cache (in my current design, this scenario is ignored).
may require the L2 cache
to send a message to the core currently holding the cache line for
writing to write back its contents, with the request unable to be
handled until after the second core has written back the dirty cache
line.
L2 has to know something about how L1 has the line, and likely which
core cache the data is in.
Yeah.
More bookkeeping needed here...
Possibly though, L2 may not need to track the specific core, if it can send out a general message:
"Whoever holds line 123 needs to flush it."
Message then has a special behavior in that it circles the whole bus without taking any shortcut paths, and is then removed once it gets back around to the L2 cache (after presumably every other node on the bus has seen it), and or gets replaced by the appropriate ACK (if it hits an L1 cache that is holding the line in question).
Specifics likely to differ here between a message-ring bus, and other types of bus.
Possibly the comparably high latency of a message ring would not be ideal in this case.
One other possibility for a bus could be be a star-network, where message can either be point-to-point or broadcast. Say, point-to-point being used if both locations have a known address, and broadcast messages sent to every node on the bus.
Unclear if "hubs" on this bus would either need to know which "ports" correspond to which node address ranges, or simply broadcast any incoming message on all ports. Broadcast with no buffering would be cheapest/simplest, but would have overhead, and a potential for "collision" (where two nodes send a message at the same time, but don't yet see the other's message).
Likely, each hub would need a FIFO and basic message routing, but this would add cost (per-node cost is likely to be higher than that of forwarding messages along a ring).
But, there could be merit, say, if messages could get anywhere on the bus within a relatively small number of clock cycles.
This would create potential for significantly more latency in cases
where multiple cores touch the same part of memory; albeit the cores
will see each others' memory stores.
One can ARGUE that this is a good thing as it makes latency part
of the memory access model. More interfering accesses=higher
latency.
OK.
>
So, initially, weak model can be faster due to not needing any
additional handling.
>
>
But... Any synchronization points, such as a barrier or locking or
releasing a mutex, will require manually flushing the cache with a weak
model.
Not necessarily:: My 66000 uses causal memory consistency, yet when
an ATOMIC event begins it reverts to sequential consistency until
the end of the event where it reverts back to causal. Use of MMI/O
space reverts to sequential consistency, while access to config
space reverts all the way back to strongly ordered.
In my case, RAM like and MMIO use different messaging protocols...
Not currently any scheme in place to support consistency modeling for RAM like access.
MMIO is ordered mostly as the L1 cache will not let anything more happen until it gets a response (so, the L1 cache forces sequential operation on its end). On the other end, the bridge to the MMIO bus will become "busy" and not respond to any more requests until the currently active request has been completed (so, it is a serialized "first come, first serve" as far as message arrival on the ringbus).
Atomic operations on the bus could likely be formed as a special form of MMIO SWAP request (with a few bits somewhere used to encode which operator to perform). Well, unless the only supported atomic operator is SWAP.
Likely it would depend on the target device for whether or not atomic operators are allowed.
And, locking/releasing the mutex itself will require a mechanism
that is consistent between cores (such as volatile atomic swaps or
similar, which may still be weak as a volatile-atomic-swap would still
not be atomic from the POV of the L2 cache; and an MMIO interface could
be stronger here).
>
>
Seems like there could possibly be some way to skip some of the cache
flushing if one could verify that a mutex is only being locked and
unlocked on a single core.
>
Issue then is how to deal with trying to lock a mutex which has thus far
been exclusive to a single core. One would need some way for the core
that last held the mutex to know that it needs to perform an L1 cache
flush.
This seems to be a job for Cache Consistency.
Possibly so...
Though, one possibility could be to leave this part to the OS
scheduler/syscall/...
The OS wants nothing to do with this.
Unclear how to best deal with it...
Status quo:
Lock/release using system calls;
System calls always perform L1 flush
( ... if there were more than 1 core ... ).
Faster:
Lock/Release handled purely in userland;
Delay or avoid cache flushes.
Hybrid:
Try to have a fast-path in userland ("local core only" mutexes);
Fall back to syscalls if not fast-path.
Lazy hybrid:
Lock/release continue using system calls;
Nothing changes as far as userland cares.
Try to delay the L1 flushes.
Say, to save the ~20k clock-cycles this process eats.
Lazy flushing on syscalls and scheduler events seems possible, as (assuming the core isn't frozen) this will happen eventually.
Does mean a scenario can occur (where a previously assumed local-only mutex is in-fact non-local) could take an unreasonably long time to deal with (one core needing to wait until the other core does a system call or similar).
Note that if a mutex lock happens, and can't be handled immediately, general behavior is to mark the task as waiting on a mutex and then switch to a different task (this is otherwise similar to how calls like "usleep()" are handled, task can be resumed once mutex is no longer held).
Though, for now, TestKern is still purely single-processor.
But, not much motivation to invest in multicore TestKern when I can still generally only fit a single core on the XC7A100T.
Can at least go dual core on the XC7A200T, but hadn't really made any use of it (so the second core sits around mostly doing nothing in this case).
Where, in the single core case, no real way to handle mutexes other than to reschedule the task.
So, some of this is still kinda theoretical.
Admittedly, it wasn't until fairly recently that TestKern got preemptive task scheduling. And, even then, there were still a lot of race-condition type bugs early on (well, partly stemming from the general lack of mutexes in many cases; and pretty much entirely absent in the kernel because, as-is, there is no way to actually resolve a mutex conflict in the kernel should one occur...).
Well, and for userland, I ended up with generally using "reschedule on syscalls" rather than "reschedule on timer IRQ", as "reschedule on syscalls" was slightly less prone to result in the sorts of race conditions that caused stuff to break (only uses timer IRQ as a fallback if the task has managed to hold the CPU for an unreasonable amount of time).
But, yeah, a lot is still "in theory" for now, actual state of TestKern still kinda sucks on this front...
mechanism; so the core that wants to lock the
mutex signals its intention to do so via the OS, and the next time the
core that last held the mutex does a syscall (or tries to lock the mutex
again), the handler sees this, then performs the L1 flush and flags the
mutex as multi-core safe (at which point, the parties will flush L1s at
each mutex lock, though possibly with a timeout count so that, if the
mutex has been single-core for N locks, it reverts to single-core
behavior).
>
This could reduce the overhead of "frivolous mutex locking" in programs
that are otherwise single-threaded or single processor (leaving the
cache flushes for the ones that are in-fact being used for
synchronization purposes).
>
The cost of mutex locking could almost be ignored...
Until of course people are trying to use otherwise frivolous mutex locks to protect things that are only ever accessed by a single thread (as has sort of become the style in many codebases), etc.
Or, say, burning extra clock-cycles in the name of "malloc()" being thread-safe (even if, much of the time, the mutex hiding inside the malloc/free calls or similar isn't actually protecting anything).
....