In article <7a093bbb356e3bda3782c15ca27e98a7@
www.novabbs.org>,
MitchAlsup1 <
mitchalsup@aol.com> wrote:
On Wed, 26 Mar 2025 4:08:57 +0000, Dan Cross wrote:
[snip]
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack?
>
No
You make multiple references to a stack just below. I
understand stack references are excluded from the atomic event,
but I asked whether the architecture uses a stack. Does it, or
doesn't it?
What sort of alignment criteria do you impose?
>
On the participating data none. On the cache line--cache line.
I meant for data on the stack. The point is moot, though, since
as you said stack data does not participate in the atomic event.
At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one.
>
The illustrated code is using 6 participating cache lines.
Where local variables are kept (stack, registers) does not
count against the tally of participating cache lines.
Huh. How would this handle something like an MCS lock, where
the lock node may be on a stack, though accessible globally in
some virtual address space?
Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something.
>
Data stack is DoubleWord aligned at all times.
Ok. So just for context, the point of this question was to
understand whether local data participating in one of these
atomic events could span multiple cache lines. Since they
don't participate it doesn't matter vis the current discussion.
these are all in registers and it's fine.
>
Local variables are not participating in the event--they are just
around to guide the event towards conclusion.
Ok.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
5. Similarly, `from->next`
6. And `from->prev`
7. And `from_prev->next`
8. And `from_next->prev`
9. And `to_next->prev`.
>
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
The programmer best make sure the optimizer is keeping those
temporary pointers off the stack; good luck with a debug build.
>
Anyway, let's assume that's not the case, and we don't hit the
pathological case with cache lines, and we're not dealing with
edge cases at the front or end of the lists (so various next or
prev pointers are not NULL). Why don't I try to explain what I
think this code is doing, and you tell me whether I'm right or
wrong.
>
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework.
>
So far so good.
>
Earlier you wrote,
"esmINTERFERENCE is a query on the state of th event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program. If it were simply a query at the
time it was written in the code, then it seems like it opens a
TOCTOU bug (what if something interfers after the check, but
before the the locked store at the end?). Or perhaps it signals
to the architecture that this is the point to which the
processor should rollback, if anything subsequently fails; sort
of a `setjmp` kind of thing that conceptually returns twice.
>
The "if( esmINTERFERENCE() )" not only queries the state of the
participating cache lines, it sets up a guarding-shadow over
the STs to follow such that if killing-interference is detected
none of the STs are performed AND control is transferred to the
branch label (outside of the event). So not only does it query
the state at it execution, it monitors that over the rest of
the event.
>
BUT ALSO: it initializes the ability to NAK SNOOPs. The thought
is that if we have reached the manifestation phase of the event
(that is ready to do all the stores) then we should be able to
drive the event forward by not sending the data in response to
the SNOOP. We send a Nak (based on priority or the data)
>
If the SNOOPing core receives a NaK and it is not attempting an
event the request is simply reissued. If it is attempting, then
its event fails. So a core benignly accessing data that another
core is attempting an event upon, is delayed by an interconnect
round trip--but if it is attempting, it knows its event will
fail and it will either retry from before the event started,
or it will goto the interferrence-label where other facilities
\are at its disposal.
Ok.
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
>
From a SW perspective, that is accurate enough.
>
From a HW perspective, the participating STs are performed and
held in a buffer, and when it is known the event has succeeded,
a bit is flipped and these writes to memory all appear to have
been performed in the same cycle. One cycle earlier and no
interested 2nd party has seen any of them, the subsequent cycle
any interested 2nd party will see them all.
>
I should also note: ESM works even when there are no caches in
the system. ...
Ok, good to go.
Is that correct? Let's suppose that it is.
>
But later you wrote,
>
So, if you want the property whereby the lock disappears on any
control transfer out of the event {exception, interrupt, SVC, SVR, ...};
then you want to use my ATOMIC stuff; otherwise, you can use the
normal ATOMIC primitives everyone and his brother provide.
>
Well, what precisely do you mean here when you say, "if you want
the property whereby the lock disappears on any control transfer
out of the event"?
>
If you want that property--you use the tools at hand.
If you don't just use them as primitive generators.
I wasn't clear enough here. I'm asking what, exactly, you mean
by this _property_, not what you mean when you write that one
can use these atomic events if one wants the property. That is,
I'm asking you to precisely define what it means for a lock to
"disappear".
It seems clear enough _now_ that once the event concludes
successfully the lock value is set to whatever it was set to in
during the event, but that wasn't clear to me earlier and I
wanted confirmation.
In particular, it seems odd to me that one would bother with a
lock of some kind during an event if it didn't remain set after
the event. If you do all of your critical section stuff inside
of the event, and setting the lock in the event is not visible
until the event concludes, why bother? If on the other hand you
use the event to set the lock, why bother doing additional work
inside the event itself, but in this case I definitely don't
want something else to come along and just unset the lock on its
own, higher priority or not.
What I _hope_ you mean is that if you transfer "out of the
event" before the `esmLOCKstore` then anything that's been done
since the "event" started is rolled back, but NOT if control
transfer happens _after_ the `esmLOCKstore`. If THAT were the
case, then this entire mechanism is useless.
>
One can use a esmLOCKstore( *SP, 0 ) to abandon an event. HW
detects that *SP is not participating in the event, and uses
the lock bit in the instruction to know it is time to leave
the event in the failed manner. No ST is performed to the
non-participating cache line. {{I guess I could wrap the
abandonment into a macro ...}}
That's not what I was asking. I'm not interested in what
happens if I transfer control out of the event myself, as it is
obviously my responsibility to ensure whatever contract the
hardware requires in that case is fulfilled. I'm interested in
what happens control is usurped by some higher priority thing.
Perhaps you would clarify?
>
But for the time being, let's assume that `esmLOCKstore` is
conceptually a commit that ends the atomic event, and the store
is durable and observable after that. Let's see if we can
construct a simple spinlock from these primitives.
>
Notice that only the last ST has the esmLOCKstore() applied.
HW knows that STs to participating lines have the ATOMIC
property. The final ST uses the LOCK bit to denote the end
of the event.
Again, not what I was asking. I'm trying to understand what you
meant when you talked about disappearing locks.
Again, I think you meant that, if an atomic event that sets a
lock variable fails, whether it got far enough along to set the
variable inside the event or not is not observable if the event
fails. But if the event succeeds, then the variable is set and
that's that. IF that's the case, then great! But if not, then
disappearing locks are bad. This is especially murky to me
because there seem to be bits of OS primitives, like threads and
priority, that are known directly to the hardware.
typedef struct Spinlock Spinlock;
struct Spinlock {
uint32_t lock;
};
>
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected && !esmINTERFERENCE()) {
esmLOCKstore(&lk->lock, val)
}
return v;
}
>
If the interference check is desired at all:: (see below)
I think you should move interference check in front of the
expected check, but that is a first impression--have to think
about it more.
Probably, given that it sets up a guard over the event, as you
wrote above.
void
spin_acquire(volatile Spinlock *lk)
{
splhi();
while (xcas(lk, 0, 1) != 0)
relax(); // Spin loop intrinsic.
}
>
void
spin_release(volatile Spinlock *lk)
{
xcas(lk, 1, 0);
// Maybe we should `spllo` here, maybe not. Either choice
// reflects lots of assumptions about whether or not we might
// take and hold multiple locks concurrently, etc.
spllo();
}
>
Does this seem correct? It is not immediately clear to me
whether `esmINTERFERENCE` is actually needed here, but
avoiding the store seemed fine.
>
There are MANY cases where the interference query is not needed.
Without the interference query, deadly interference backs the
core up to the first instruction of the event. With the interference
check, if deadly interference happens before the query, control
reverts to before the event, and afterwards, control is transferred
to the interference-label.
>
One could imagine shorting
`xcas` to simply:
>
uint32_t
xcas(volatile Spinlock *lk, uint32_t expected, uint32_t val)
{
uint32_t v = esmLOCKload(&lk->lock);
if (v == expected) esmLOCKstore(&lk->lock, val);
return v;
}
>
Yes, unless you want the alternate exit strategy, this works fine.
In fact test and set is written::
>
boolean char TestAndSet( type char *p )
{
boolean t = esmLOCKload ( *p );
esmLOCKstore( *p, 1 );
return t;
}
>
Should interference transpire between load and store control
reverts back to the LD again, so the return value is the LD
which did not encounter interference.
Ok, so perhaps this is a reasonable implementation of spinlock
acquire, as I wrote to Stefan yesterday:
void
spinlock_acquire(volatile Spinlock *lk)
{
while (esmLOCKload(&lk->lock) != 0)
cpu_spin_hint();
esmLOCKstore(&lk->lock, 1);
}
Yes? If this is preempted, or another CPU wins the race on the
lock, the hardware will back this up and start over in the while
loop, right?
For the sizes of CSs this covers--it prevent priority inversion.
>
Except you've admitted that this is not a general solution, as
it cannot "cover" arbitrary software-defined critical sections:
the window isn't big enough and we've seen how even a fairly
small problem may exceed the number of cache lines supported by
one of these events.
>
Remember only the lines participating in the event count. For
example, say one had sprintf() inside the event. None of the
lines touched by springt() are participating in the event,
and are not counted against the 8 cache lines available.
This doesn't change the point, though. Anything that touches
more than 8 cache lines is not suitable for an atomic event.
Insertion into a doubly linked list is already up to 6 lines,
recall that my original hypothetical also involved popping off a
queue, which adds another (the queue head). If I threw up
incrementing a couple of counters I can't do it. And this seems
like a perfectly reasonable real-world scenario; consider a work
stealing thread scheduler that takes a task from one core to
another and keeps some stats on how things move.
Or maybe even just swapping places between two elements in a
linked list. For example:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
if (*head == a)
*head = b;
else if (*head == b)
*head = a;
an = a->next;
ap = a->prev;
bn = b->next;
bp = b->prev;
b->next = an;
if (an != NULL)
an->prev = b;
b->prev = ap;
if (ap != NULL)
ap->next = b;
a->next = bn;
if (bn != NULL)
bn->prev = a;
a->prev = bp;
if (bp != NULL)
bp->next = a;
}
Perhaps we might wrap this in an atomic event by annotating it
like so:
void
swap_places(Node **head, Node *a, Node *b)
{
Node *hp, *an, *ap, *bn, *bp;
assert(head != NULL);
assert(a != NULL);
assert(b != NULL);
if (a == b)
return;
esmLOCKprefetch(*head);
if (*head == a)
*head = b;
else if (*head == b)
*head = a;
an = esmLOCKload(a->next);
ap = esmLOCKload(a->prev);
bn = esmLOCKload(b->next);
bp = esmLOCKload(b->prev);
b->next = an;
if (an != NULL) {
esmLOCKprefetch(an->prev);
an->prev = b;
}
b->prev = ap;
if (ap != NULL) {
esmLOCKprefetch(ap->next);
ap->next = b;
}
a->next = bn;
if (bn != NULL) {
esmLOCKprefetch(bn->prev);
bn->prev = a;
}
if (bp != NULL) {
esmLOCKprefetch(bp->next);
bp->next = a;
}
esmLOCKstore(a->prev, bp);
}
By my count, that could be touching 9 cache lines, even with
all of the various structures well-aligned; indeed, if the Node
type were annotated with a constraint that it be aligned on a
cacheline boundary, that would be maximally pessimistic in this
regard.
Anyway, I'd like to understand what the hardware does when a
program attempts an event where the conditions for success
cannot be met.
---------------------
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
>
All I can say about this, was that when I presented ASF (ESM's
predecessor) to Microsoft and spend a couple hours explaining
to them how it works, MS wanted it implemented as rapidly as
AMD could pull it off. Dave Cutler was particularly impressed.
I'm not sure that a Dave C. ref is the flex you are making it
out to be.
And here I thought you had an imagination .....
>
You admitted you're not an OS person and asked other people to
poke holes in your proposal. I'm sorry if you find the holes
I'm poking upsetting.
>
I enjoy your poking of holes--I really do
>
But you seem like the calculus student that keeps asking what
problem can I solve now that I can differentiate.
Huh. Well, I don't know what to tell ya about that, guy.
Look, here's the deal. This all sounds neat. And my point is
not to bore you with a bunch of impudent challenges. You've
obviously had a long and distinguished career, and I can't
imagine what purpose that would serve, other than being a jerk.
I'm not interested in that.
But there isn't a lot of publicly available documentation for
any of this, at least none that I could find, and from the way
it's been described thus far it seems like there are a number of
edge cases that it is not clear how one is supposed to address.
So I want to understand the boundaries of where these hardware
facilities are applicable, given architectural limitations. I
think this thread has shown that those boundaries exist. The
universe of interesting and useful software is big, and someone,
somewhere, is going to trip over those limitations sooner or
later. Once we accept that, we must ask, "Then what?" It seems
obvious that programmers are going to have to find techniques to
work around those limitations, and that the mechanisms chosen
are unknowable to hardware.
But now we're back in the same boat of software having to deal
with all the thorny problems that these mechanisms were designed
to address in the first place, like priority inversion: take the
spinlock from earlier, the hardware has no semantic knowledge
that the lock is used as a mutex, and can't know; it's just some
bit in memory somewhere that was set during an atomic event.
Maybe ESM helps in a race when trying to acquire the spinlock,
but once it's acquired, it's acquired, and because the spinlock
is not a primitive understood by the hardware, the hardware has
no way of incorporating the spinlock's semantics into its view
of priority, threading, etc.
So while an atomic event might help me in the implementation of
a generic mutex of some kind, so that I can protect critical
sections that can't be handled with the atomic event framework
directly because of their size, that's the extent of the help it
can give me, because once the mutex is acquired the hardware has
no insight into the fact that it's a mutex. So if some higher
priority thing subsequently comes along and preempts the thread
holding that lock and tries to take that lock itself, then the
hardware can't help me out anymore, and I'm back to a
priority-inversion deadlock that has to be dealt with by
software.
The point is, even if the hardware gives me a cool
mechanism for helping prevent priority inversion problems,
programmers are inevitably going to have to handle those
situations themselves anyway.
This is the sort of thing I'm trying to communicate, it's really
not me sitting here tossing out barbs like, "oh yeah?! Well,
how about THIS one, smart guy?"
To use your mathematics analogy, you are describing a technique
for differeniating real single variable continuous functions.
I'm pointing out that this is not applicable to all functions,
because discontinuous functions and higher order spaces exist,
and giving examples of such functions that are important parts
of real-world problems, not just abstract distractions.
I can't help feel like you're telling me to just stick to
polynomials over R because calculus is useful. Calculus may be
useful, but sometimes you really do need analysis.
- Dan C.