Sujet : Re: 88xxx or PPC
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 26. May 2024, 21:36:54
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v306h9$3iqbj$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 5/25/24 11:14 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
On 3/8/24 11:14 PM, MitchAlsup1 wrote:
Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.
Since the variable length decoder is only 32 gates (equivalent in
size to 3 1-bit flip-flops) one can simply attach said decoder
to every word of storage in the instruction buffer. And arrange
a tree of "If I get picked, here are my follow on instructions"
Now, once one has a unary pointer into the IB, one gets 2 inst
in 1 gate of delay, 4 in 2 gates, 8 in 3 gates,...until you
get eaten alive with wire delay.
Thus, if length decoding is easy, predecoding (into some kind of
able) is unnecessary.
Predecoding is not solely about parsing the instruction stream.
The example I gave was for quickly special-casing loads from a
stack pointer or global pointer to allow such to have very low
latency by executing early. (Caches indexed by immediate offsets
do not require register read much less address generation.)
One type of predecode that has been commercially implemented (in a
POWER processor) was storing calculated branch insets rather than
offsets. This is very similar to an inlined BTB entry: if the base
virtual address was different in the lower 18-bits (16-bit
immediate with 4-byte word addressing for instructions), the inset
would be wrong, i.e., this is somewhat speculative.
If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.
Mistakes in predecode that mistook an immediate 32-bit word for an
opcode-containing word might not be particularly painful.
Now when these are mask out by the actual decode selection tree.
Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).
It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.
I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).
It is a trade-off:: in a GBOoO design, adding a pipe stage cost around 2% (in an LBIO design around 5%) so the predictor has to
buy more than 2% to "make the cut". It definitely would not make
cut in the LBIO design, it may or may not make the cut in a GBOoO design. What we can say is: that the GBOoO design has to have some
kind of branch prediction and not go so far as to assign is a name
or a class.
Adding a pipeline stage to cache refill would presumably be less expensive. Not only is an L2 access already somewhat slow, but the expectation is that L1 hits will be the common case.
There might also be uses for storing additional information in a
loop buffer or in instruction scheduling hardware to simplify
processing. For such loop optimizations, part of the difficulty
seems to be in deciding when to spend a little more time/energy
that modestly helps each iteration; for high interaction counts
such might be substantially beneficial but an actual loss for low
count loops. At the level of design effort, time, and risk, there
is a similar utilization/fractional benefit aspect.