Sujet : Re: 88xxx or PPC
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 26. May 2024, 04:14:30
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <fb85aae33e177f015053da6c6470aa23@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
User-Agent : Rocksolid Light
Paul A. Clayton wrote:
On 3/8/24 11:14 PM, MitchAlsup1 wrote:
Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.
Since the variable length decoder is only 32 gates (equivalent in
size to 3 1-bit flip-flops) one can simply attach said decoder
to every word of storage in the instruction buffer. And arrange
a tree of "If I get picked, here are my follow on instructions"
Now, once one has a unary pointer into the IB, one gets 2 inst
in 1 gate of delay, 4 in 2 gates, 8 in 3 gates,...until you
get eaten alive with wire delay.
Thus, if length decoding is easy, predecoding (into some kind of
able) is unnecessary.
If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.
Mistakes in predecode that mistook an immediate 32-bit word for an
opcode-containing word might not be particularly painful.
Now when these are mask out by the actual decode selection tree.
Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).
It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.
I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).
It is a trade-off:: in a GBOoO design, adding a pipe stage cost around 2% (in an LBIO design around 5%) so the predictor has to
buy more than 2% to "make the cut". It definitely would not make
cut in the LBIO design, it may or may not make the cut in a GBOoO design. What we can say is: that the GBOoO design has to have some
kind of branch prediction and not go so far as to assign is a name
or a class.