Re: Memory ordering

Liste des GroupesRevenir à c arch 
Sujet : Re: Memory ordering
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.arch
Date : 30. Jul 2024, 10:51:46
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Jul30.115146@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7
User-Agent : xrn 10.11
mitchalsup@aol.com (MitchAlsup1) writes:
On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:
>
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
Similarly, I expect that hardware that is designed for good TSO or
sequential consistency performance will run faster on code written for
this model than code written for weakly consistent hardware will run
on that hardware.
>
According to Lamport; only the ATOMIC stuff needs sequential
consistency.
So, it is completely possible to have a causally consistent processor
that switches to sequential consistency when doing ATOMIC stuff and gain
performance when not doing ATOMIC stuff, and gain programmability when
doing atomic stuff.
>
That's not what I have in mind.  What I have in mind is hardware that,
e.g., speculatively performs loads, predicting that no other core will
store there with an earlier time stamp.  But if another core actually
performs such a store, the usual misprediction handling happens and
the code starting from that mispredicted load is reexecuted.  So as
long as two cores do not access the same memory, they can run at full
speed, and there is only slowdown if there is actual (not potential)
communication between the cores.
>
OK...
>
A problem with that approach is that this requires enough reorder
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).
>
The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.

So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.

Concerning the cheaper mechanism, what I am thinking of is hardware
checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong
microbenchmarks).  The cost of such a checkpoint is (at most) the
number of architectural registers, plus the aggregated stores between
the checkpoint and the next one.  Once the global time reaches the
timestamp of the checkpoint N+1 of the core, checkpoint N of the core
can be released (i.e. all its instructions committed) and all its
stores can be commited (and checked against speculative loads in other
cores).  If it turns out that an uncommited load's result has been
changed by a store commited by another core, a rollback to the latest
checkpoint before the load happens, and the program is re-executed
starting from that checkpoint.

Daya et al [daya+14] have already implemented sequential consistency
in their 36-core research chip, with similar ideas (that inspired my
statement above) and much more detail (that makes it hard to see the
grand scheme of things IIRC).

@InProceedings{daya+14,
  author =       {Bhavya K. Daya and Chia-Hsin Owen Chen and Suvinay
                  Subramanian and Woo-Cheol Kwon and Sunghyun Park and
                  Tushar Krishna and Jim Holt and Anantha
                  P. Chandrakasan and L-Shiuan Peh},
  title =        {{SCORPIO}: A 36-Core Research-Chip Demonstrating
                  Snoopy Coherence on a Scalable Mesh {NoC} with
                  In-Network Ordering},
  crossref =     {isca14},
  OPTpages =     {},
  url =          {http://projects.csail.mit.edu/wiki/pub/LSPgroup/PublicationList/scorpio_isca2014.pdf},
  annote =       {The cores on the chip described in this paper access
                  their shared memory in a sequentially consistent
                  manner; what's more, the chip provides a significant
                  speedup in comparison to the distributed directory
                  and HyperTransport coherence protocols.  The main
                  idea is to deal with the ordering separately from
                  the data, in a distributed way.  The ordering
                  messages are relatively small (one bit per core).
                  For details see the paper.}
}

@Proceedings{isca14,
  title = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
  booktitle = "$41^\textit{st}$ Annual International Symposium on Computer Architecture",
  year = "2014",
  key = "ISCA 2014",
}

The operations themselves are not slow.
>
Citation needed.
>
A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.

Not sure what kind of speculation you mean here.  On in-order cores
like the non-Fujitsu SPARCs from before about 2010 memory barriers are
expensive AFAIK, even though there is essentially no branch
speculation on in-order cores.

Of course, if you mean speculation about the order of loads and
stores, yes, if you don't have such speculation, the memory barriers
are fast, but then loads are extremely slow.

Memory consistency is defined wrt what several processors do.  Some
processor performs some reads and writes and another performs some
read and writes, and memory consistency defines what a processor sees
about what the other does, and what ends up in main memory.  But as
long as the processors, their caches, and their interconnect gets the
memory ordering right, the main memory is just the backing store that
eventually gets a consistent result of what the other components did.
So it does not matter whether the main memory has one bank or 256.
>
NEC SX is a multi-processor vector machine with the property that
addresses are spewed out as fast as AGEN can perform. These addresses
are routed to banks based on bus-segment and can arrive OoO wrt
how they were spewed out.
>
So two processors accessing the same memory using vector LDs will
see a single vector having multiple memory orderings. P[0]V[0] ordered
before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ...

As long as no stores happen, who cares about the order of the loads?
When stores happen, the loads are ordered wrt these stores (with
stronger memory orderings giving more guarantees).  So the number of
memory banks does not matter for implementing a strong ordering
efficiently.

The thinking about memory banks etc. comes when you approach the
problem from the other direction: You have some memory subsystem that
by itself gives you no consistency guarantees whatsoever, and then you
think about what's the minimum you can do to make it actually useful
for inter-core communication.  And then you write up a paper like

@TechReport{adve&gharachorloo95,
  author =       {Sarita V. Adve and Kourosh Gharachorloo},
  title =        {Shared Memory Consistency Models: A Tutorial},
  institution =  {Digital Western Research Lab},
  year =         {1995},
  type =         {WRL Research Report},
  number =       {95/7},
  annote =       {Gives an overview of architectural features of
                  shared-memory computers such as independent memory
                  banks and per-CPU caches, and how they make the (for
                  programmers) most natural consistency model hard to
                  implement, giving examples of programs that can fail
                  with weaker consistency models.  It then discusses
                  several categories of weaker consistency models and
                  actual consistency models in these categories, and
                  which ``safety net'' (e.g., memory barrier
                  instructions) programmers need to use to work around
                  the deficiencies of these models.  While the authors
                  recognize that programmers find it difficult to use
                  these safety nets correctly and efficiently, it
                  still advocates weaker consistency models, claiming
                  that sequential consistency is too inefficient, by
                  outlining an inefficient implementation (which is of
                  course no proof that no efficient implementation
                  exists).  Still the paper is a good introduction to
                  the issues involved.}
}

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

Date Sujet#  Auteur
24 Jul 24 * Arguments for a sane ISA 6-years later63MitchAlsup1
25 Jul 24 `* Re: Arguments for a sane ISA 6-years later62BGB
25 Jul 24  +* Re: Arguments for a sane ISA 6-years later57Chris M. Thomasson
26 Jul 24  i`* Re: Arguments for a sane ISA 6-years later56Anton Ertl
26 Jul 24  i +* Re: Arguments for a sane ISA 6-years later20BGB
29 Jul 24  i i`* Re: Arguments for a sane ISA 6-years later19Anton Ertl
29 Jul 24  i i +* Intel overvoltage (was: Arguments for a sane ISA 6-years later)2Thomas Koenig
29 Jul 24  i i i`- Re: Intel overvoltage1BGB
29 Jul 24  i i `* Re: Arguments for a sane ISA 6-years later16BGB
30 Jul 24  i i  `* Re: Arguments for a sane ISA 6-years later15Anton Ertl
30 Jul 24  i i   `* Re: Arguments for a sane ISA 6-years later14BGB
30 Jul 24  i i    +* Re: Arguments for a sane ISA 6-years later2Chris M. Thomasson
31 Jul 24  i i    i`- Re: Arguments for a sane ISA 6-years later1BGB
1 Aug 24  i i    `* Re: Arguments for a sane ISA 6-years later11Anton Ertl
1 Aug 24  i i     +- Re: Arguments for a sane ISA 6-years later1Michael S
1 Aug 24  i i     +* Re: Arguments for a sane ISA 6-years later8MitchAlsup1
1 Aug 24  i i     i+- Re: Arguments for a sane ISA 6-years later1Michael S
2 Aug 24  i i     i`* Re: Arguments for a sane ISA 6-years later6MitchAlsup1
2 Aug 24  i i     i +- Re: Arguments for a sane ISA 6-years later1Michael S
4 Aug 24  i i     i `* Re: Arguments for a sane ISA 6-years later4MitchAlsup1
5 Aug 24  i i     i  `* Re: Arguments for a sane ISA 6-years later3Stephen Fuld
5 Aug 24  i i     i   `* Re: Arguments for a sane ISA 6-years later2Stephen Fuld
5 Aug 24  i i     i    `- Re: Arguments for a sane ISA 6-years later1MitchAlsup1
1 Aug 24  i i     `- Re: Arguments for a sane ISA 6-years later1BGB
26 Jul 24  i +* Re: Arguments for a sane ISA 6-years later20MitchAlsup1
27 Jul 24  i i+- Re: Arguments for a sane ISA 6-years later1BGB
29 Jul 24  i i`* Memory ordering (was: Arguments for a sane ISA 6-years later)18Anton Ertl
29 Jul 24  i i +* Re: Memory ordering15MitchAlsup1
29 Jul 24  i i i+* Re: Memory ordering6Chris M. Thomasson
29 Jul 24  i i ii`* Re: Memory ordering5MitchAlsup1
30 Jul 24  i i ii `* Re: Memory ordering4Michael S
31 Jul 24  i i ii  `* Re: Memory ordering3Chris M. Thomasson
31 Jul 24  i i ii   `* Re: Memory ordering2Chris M. Thomasson
31 Jul 24  i i ii    `- Re: Memory ordering1Chris M. Thomasson
30 Jul 24  i i i`* Re: Memory ordering8Anton Ertl
30 Jul 24  i i i +* Re: Memory ordering2Chris M. Thomasson
30 Jul 24  i i i i`- Re: Memory ordering1Chris M. Thomasson
31 Jul 24  i i i `* Re: Memory ordering5MitchAlsup1
31 Jul 24  i i i  +- Re: Memory ordering1Chris M. Thomasson
1 Aug 24  i i i  `* Re: Memory ordering3Anton Ertl
1 Aug 24  i i i   `* Re: Memory ordering2MitchAlsup1
2 Aug 24  i i i    `- Re: Memory ordering1Anton Ertl
29 Jul 24  i i `* Re: Memory ordering2Chris M. Thomasson
30 Jul 24  i i  `- Re: Memory ordering1Chris M. Thomasson
29 Jul 24  i +* Re: Arguments for a sane ISA 6-years later13Chris M. Thomasson
29 Jul 24  i i+* Re: Arguments for a sane ISA 6-years later9BGB
29 Jul 24  i ii`* Re: Arguments for a sane ISA 6-years later8Chris M. Thomasson
29 Jul 24  i ii +- Re: Arguments for a sane ISA 6-years later1Chris M. Thomasson
29 Jul 24  i ii +* Re: Arguments for a sane ISA 6-years later2BGB
29 Jul 24  i ii i`- Re: Arguments for a sane ISA 6-years later1Chris M. Thomasson
30 Jul 24  i ii `* Re: Arguments for a sane ISA 6-years later4jseigh
30 Jul 24  i ii  `* Re: Arguments for a sane ISA 6-years later3Chris M. Thomasson
31 Jul 24  i ii   `* Re: Arguments for a sane ISA 6-years later2jseigh
31 Jul 24  i ii    `- Re: Arguments for a sane ISA 6-years later1Chris M. Thomasson
29 Jul 24  i i+- Memory ordering (was: Arguments for a sane ISA 6-years later)1Anton Ertl
29 Jul 24  i i`* Re: Arguments for a sane ISA 6-years later2MitchAlsup1
29 Jul 24  i i `- Re: Arguments for a sane ISA 6-years later1BGB
6 Aug 24  i `* Re: Arguments for a sane ISA 6-years later2Chris M. Thomasson
6 Aug 24  i  `- Re: Arguments for a sane ISA 6-years later1Chris M. Thomasson
26 Jul 24  `* Re: Arguments for a sane ISA 6-years later4MitchAlsup1
27 Jul 24   +- Re: Arguments for a sane ISA 6-years later1BGB
28 Jul 24   `* Re: Arguments for a sane ISA 6-years later2Paul A. Clayton
28 Jul 24    `- Re: Arguments for a sane ISA 6-years later1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal