In article <jwv4izfncov.fsf-monnier+
comp.arch@gnu.org>,
Stefan Monnier <
monnier@iro.umontreal.ca> wrote:
Leaving aside the unchecked boundary conditions I highlighted
above, there are other questions that this raises. You
mentioned that the ESM stuff can use up to 8 cache lines in a
single atomic "event"; let's count how many this code may
encounter:
>
1. `from` may point to a cache line
2. `to` may point to a different cache line.
3. Does your architecture use a stack? What sort of alignment
criteria do you impose? At first blush, it seems to me that
the pointers `from_next`, `from_prev`, and `to_next` could be
on the stack and if so, will be on at least one cache line,
and possibly 2, if the call frame for `MoveElement` spans
more than one. Of course, it's impossible to say when this
is compiled, but maybe you require that the stack is always
aligned to a cache line boundary or something. Or maybe
these are all in registers and it's fine.
>
AFAIK the stack accesses will use normal loads that won't be tracked for
conflicts, so they don't count towards the 8 cache-lines limit.
Ok, good to go.
4. Depending on the size and alignment criteria of the the
`Element` structure and the placement of the `next` and
`prev` elements in the struct (unknowable from this example),
`to->next` may be on another cache line.
>
The coder may need to be careful with such slot placement and
alignment, indeed.
Yup. This could easily happen. Consider,
#include <stdalign.h>
#include <stdint.h>
#include <stdio.h>
struct elem {
struct elem *next;
struct elem *prev;
uint32_t data;
};
struct elem arr[40];
int
main()
{
printf("sizeof elem: %zu\n", sizeof(struct elem));
printf("alignof elem: %zu\n", alignof(struct elem));
printf("sizeof(arr): %zu\n", sizeof(arr));
return 0;
}
On, say, x86_64 with the SVR4 ABI, this has alignment 8 and
sizeof 24; presumably cache line size is a power of two, so
some of the elements of that array are going to spill over a
cache line.
So potentially this code _could_ touch 9 or 10 cache lines, more
than you said are supported for a single atomic event. What
happens in that case? Does the code just always return false?
>
I believe the code will indeed always return false in that case.
That doesn't seem great. Surely there must be some way to
detect that the code did not meet the criteria for an atomic
event? What is the software recourse for recovery in this case?
If I understand this correctly, the `esmLOCKload`s are simply
loads, but they make sure that the compiler emits instructions
that tie into your atomic support framework. Earlier you wrote,
"esmINTERFERENCE is a query on the state of the event and whether
a 2nd party has performed a memory action that kills the event."
Ok, I take this to mean _at the time of the store_, which
appears later in the program.
>
No, I believe it is at the time when `esmINTERFERENCE` is called.
>
If it were simply a query at the time it was written in the code, then
it seems like it opens a TOCTOU bug (what if something interfers after
the check, but before the the locked store at the end?).
>
IIUC the way this works is that in case there's a conflict later, the
other core will get a NACK (i.e. we'll win and they'll lose, even if
they have a higher priority) so we're guaranteed that we'll finish our
transaction successfully. Of course, that can be guaranteed only if we
already have "locked" the cache lines we'll need, which is what the
`esmLOCKprefetch` calls are about.
>
IOW the `esmINTERFERENCE` call is the time where the CPU decides whether
to "commit", and then the code between `esmINTERFERENCE` and
`esmLOCKstore` is the code which actually stores the changes we
decided to commit.
Curious. So what happens if a higher priority interrupt comes
in after the `esmINTERFERENCE` call, but before `esmLOCKstore`?
Earlier, Mitch wrote:
|My architecture has a mechanism to perform ATOMIC stuff over multiple
|instruction time frames that has the property that a higher priority
|thread which interferers with a lower priority thread, the HPT wins
|and the LPT fails its ATOMIC event. It is for exactly this reason
|that I drag priority through the memory hierarchy--so that real-time
|things remain closer to real-time (no or limited priority inversions).
This implies to me that the higher-prio interrupt would take
precedence, preempting the lower-priority critical section, and
the hardware would reuse the branch-prediction repair facility
to undo the effects of the atomic event before the store. But
what you wrote suggests that instead, perhaps, the higher
priority interrupt would be deferred until the store, in which
case, what happens if the code does something goofy like,
if (!esmINTERFERENCE())
for(;;) { /* Priority inversion */ }
I suppose this hinges on what it means for a thread to "fail its
ATOMIC event".
The thing that makes the most sense to me, given what you said
about the hardware automatically retrying, is that the point of
failure would be at the store. If that fails, the entire thing
is rewound and restarted from the first `esmLOCKload` since the
last `esmLOCKstore`; if `esmINTERFERENCE` returns true, it's
because `esmLOCKstore` has already been attempted and failed.
That would address the, "bonehead spins forever with interrupts
deferred" thing.
IIUC, the `esmINTERFERENCE` call is not needed, in which case the
operation should retry as many times as needed to succeed.
Ok. So,
void
spinlock_acquire(volatile Spinlock *lk)
{
while (esmLOCKload(&lk->lock) != 0)
relax();
esmLOCKstore(&lk->lock, 1);
}
Should be sufficient?
Assuming nothing has poisoned the event, the various pointer
stores inside the conditional are conditionally performed; the
`esmLOCKstore` to `from->next` commits the event, at which
point all side effects are exposed and durable.
>
The `esmLOCKstore` marks the end of the transaction. IIUC it can also
mark the "commit" but only if `esmINTERFERENCE` isn't used.
>
So now I, as a systems software person, have to ask design
question around how I want to use these architectural features,
if at all, weighing various pros and cons. This is where I'm
struggling with your proposal: given that I'm going to have to
implement more general support for the things that the
architecture gives me anyway, what do I gain by also supporting
the specialized mechanisms?
>
I think the expectation is that you'd *use* those specialized mechanisms
but you wouldn't *support* them.
Support in the sense that if I have software that uses them, I
need to support that software.
Right now, I'm not seeing it that it is. "It solves priority
inversion for the size of problems that it's applicable to" is
not enough to justify the additional cost of adoption. I have
to solve that problem anyway, whether I use this or not.
>
The ESM stuff is what you'll use at the assembly-level instead of T&S,
C&S, LL/SC because that's what the CPU supports/provides. But if you
prefer to code against T&S, C&S, LL/SC, AFAIK it'll work as well,
because you can "simulate" them easily with ESM.
The properties you mentioned earlier vis automatic retries and
code density sound attractive. The rest surrounding it, such as
the hardware's notion of a thread, priority, etc, less so. On
balance I remain skeptical that the tradeoff is worth it.
- Dan C.