Liste des Groupes | Revenir à c arch |
On Tue, 30 Jul 2024 9:51:46 +0000, Anton Ertl wrote:[...]
mitchalsup@aol.com (MitchAlsup1) writes:Mc 88120 had a 96-wide execution window, which could be filled up inOn Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:>A problem with that approach is that this requires enough reorder>
buffering (or something equivalent, there may be something cheaper for
this particular problem) to cover at least the shared-cache latency
(usually L3, more with multiple sockets).
The depth of the execution window may be smaller than the time it takes
to send the required information around and have this core recognize
that it is out-of-order wrt memory.
So if we don't want to stall for memory accesses all the time, we need
a bigger execution window, either by making the reorder buffer larger,
or by using a different, cheaper mechanism.
16 cycles (optimistically) and was filled up in 32 cycles (average).
Given that DRAM is not going to be less than 20 ns and a 5GHz core,
the execution window is 1/3rd that which would be required to absorb
an cache miss all the way to DRAM. Add in 12-ish cycles for L2, 10-
cycles for transporting L2 miss to Memory controller, 5-cycles between
memory controller and DRAM controller-----and sooner or later it gets
hard. So nobody tries to make the execution window big enough to do
that (actually Mc88120 was to ECL buss technology and was only 100 MHz
so its puny 16-cycle EW actually was big enough to absorb a Cache
miss all the way to DRAM.....but that is for another day.)
Concerning the cheaper mechanism, what I am thinking of is hardwareYou lost me::
checkpointing every, say, 200 cycles or so (subject to fine-tuning).
The idea here is that communication between cores is very rare, so
rolling back more cycles than the minimal necessary amount costs
little on average (except that it looks bad on cache ping-pong
microbenchmarks).
Colloquially, there are 2 uses of the word checkpointing:: a) what
HW does each time it inserts a branch into the EW, b) what an OS or
application does to be able to recover from a crash (from any
mechanism).
Neither is used to describe interactions between cores.
<snip>
You dropped 64 instructions into the EW, and AGEN performs 15 address>>The operations themselves are not slow.>
Citation needed.
A MEMBAR dropped into the pipeline, when nothing is speculative, takes
no more time than an integer ADD. Only when there is speculation does
it have to take time to relax the speculation.
Not sure what kind of speculation you mean here. On in-order cores
like the non-Fujitsu SPARCs from before about 2010 memory barriers are
expensive AFAIK, even though there is essentially no branch
speculation on in-order cores.
generations in the order permitted by operand arrival. These addresses
are routed to the L1s to determine who hits and who misses--all OoO.
Thus, the Addresses are only in Operand order and they touched the
caches
in operand order.
There could be branch mispredictions in EW also causing many of the
AGENs to get thrown away after the branch is discovered to be poorly
predicted.
And ono top of all of this several FP instructions may have raised
exceptions.
EW pipelines are generally designed to "sort all this stuff out at
retirement"; occasionally, memory ordering issues are sorted out
prior to retirement by replaying OoO memory references.>An MEMBAR requires the memory order to catch up to the current point
Of course, if you mean speculation about the order of loads and
stores, yes, if you don't have such speculation, the memory barriers
are fast, but then loads are extremely slow.
before adding new AGENs to the problem space. If the memory order
is already SC then MEMBAR has nothing to do and is pushed through
the pipeline without delay.
So, the delay has to do with catching up with memory order not in
pushing the MEMBAR through the pipeline.
Les messages affichés proviennent d'usenet.