Sujet : Re: Memory ordering
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 01. Aug 2024, 16:54:55
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Aug1.175455@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9
User-Agent : xrn 10.11
mitchalsup@aol.com (MitchAlsup1) writes:
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:
>
mitchalsup@aol.com (MitchAlsup1) writes:
The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.
>
So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.
>
Mc 88120 had a 96-wide execution window, which could be filled up in
16 cycles (optimistically) and was filled up in 32 cycles (average).
Given that DRAM is not going to be less than 20 ns and a 5GHz core,
the execution window is 1/3rd that which would be required to absorb
an cache miss all the way to DRAM.
Relevant numbers for current cores are 400-600 instructions in the
reorder buffer, 6-8 instructions per cycle, and the core-to-core
latency is (for Bergamo) 30-40ns (90-120 cycles) within a CCX,
100-120ns (300-360 cyles) within a socket, with a 212ns (636 cycles)
worst case across sockets (data from
<
https://chipsandcheese.com/2024/06/22/testing-amds-bergamo-zen-4c-spam/>);
I computed with 3GHz clock rate, which fits Bergamo. On the fast
desktop chips the clock rate is higher, but the latency is lower in
both ns and cycles (in particular, no dual-socket penalty); e.g., on
the Ryzen 7950X the core-to-core latency is <80ns (456 cycles)
<
https://images.anandtech.com/doci/17585/AMD%20Ryzen%209%207950X%20Core%20to%20Core%20Latency%20Final.jpg>.
The reorder buffers (and integer register files and store buffers)
would need to be even larger to cover the 456 or 636 cycles (and maybe
even more than that latency is needed before you are sure that a load
or store is sequentially consistent), or alternatively one would need
a cheaper mechanism.
Concerning the cheaper mechanism, what I am thinking of is hardware
checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong
microbenchmarks).
>
You lost me::
>
Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or
application does to be able to recover from a crash (from any
mechanism).
What is "EW"?
Anyway, here checkpointing would be a hardware mechanism that allows
rolling back to the state at the point when the checkpoint was made.
An MEMBAR requires the memory order to catch up to the current point
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.
Yes, that's the slow implementation. The fast implementation is to
implement sequential consistency all the time (by predicting and
speculating that memory accesses do not interfer with those of other
cores, and recovering from that speculation when the speculation turns
out to be wrong). In such an implementation memory barriers are noops
(and thus fast), because the hardware already provides sequential
consistency.
Then consider 2 Vector processors performing 2 STs (1 each) to
non-overlapping addresses but with bank aliasing. Consider that
the STs are scatter based and the back conflicts random. There
is no way to determine which store happened first or which
element of each vector store happened first.
It's up to the architecture to define the order of stores and loads of
a given core. For sequential consistency you then interleave the
sequences coming from the cores in some convenient order. It does not
matter what happens earlier in some inertial system. It only matters
what your hardware decides should be treated as being earlier. The
hardware has a lot of freedom here, but the end result as visible to
the cores must be sequentially consistent (or, with a weaker memory
consistency model, consistent with that model).
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>