Re: 88xxx or PPC

Liste des GroupesRevenir à c arch 
Sujet : Re: 88xxx or PPC
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.arch
Date : 20. Apr 2024, 23:10:27
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v038qn$bmtm$2@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 3/8/24 11:14 PM, MitchAlsup1 wrote:
Paul A. Clayton wrote:
[snip interesting physical design details of SRAM/register storage]

As noted later, memory accesses can also be indexed by a fixed bit
pattern in the instruction. Determining whether a register ID bit
field is actually used may well require less decoding than
determining if an operation is a load based on stack pointer or
global pointer with an immediate offset, but the difference would
not seem to be that great. The offset size would probably also
have to be checked — the special cache would be unlikely to support all offsets.
 
Predecoding on insertion into the instruction cache could cache
this usage information.
 You cannot predecode if the instruction is not of fixed size, (or
if you do not add predecode bits ala Athlon, Opteron).
One can have variable length instructions and predecoding on fill
if one uses instruction bundles.
Heidi Pan's "Head and Tails" ("High Performance, Variable-Length
Instruction Encodings", Master's Thesis, 2002) uses fixed length
instruction components ("heads") filling from one end of the
bundle toward the middle and variable length components ("tails")
filling from the other end. This design intentionally disallowed
instructions crossing a bundle boundary and was primarily intended
for code density with parallel decode.
A more complex arrangement of bits than in "Heads and Tails" with
support for splitting immediate bits across bundle boundaries
could remove some of the code density penalty of "Heads and Tails"
while still supporting predecode on fill. The bundling only needs
to provide the ability to parse the bundle into instructions with
reasonable parallelism and for some uses failure to special case
via predecode some operations would not be problematic — those
instances might "merely" be unoptimized on the first execution
(after final decode in the first fetch the predecoded form could
be updated, at some complexity cost).
The borrowing aspect seems to require some additional information,
perhaps a pseudo-instruction that joins an immediate field with
the immediate part from the previous bundle. This would reduce
code density. In a "Heads and Tails"-like scheme, unused bits in
the middle might be automatically appended to the first immediate
in the next instruction.
(I seem to recall that there was an ISA that sacrificed half the
opcode space to provide variable-sized immediates. The first bit
of a parcel indicated whether it was an immediate to be patched
together or an operation and register operands. Such an encoding
is similar to the x86 instruction boundary marker bits.)
Even with My 66000's variable length instructions, most (by
frequency of occurrence) 32-bit immediates would be illegal
instructions and more significant 32-bit words in 64-bit
immediates would usually be illegal instructions, so one could
probably have highly accurate speculative predecode-on-fill.
If branch prediction fetch ahead used instruction addresses
(rather than cache block addresses), a valid target prediction
would provide accurate predecode for the following instructions
and constrain the possible decodings for preceding instructions.
Mistakes in predecode that mistook an immediate 32-bit word for an
opcode-containing word might not be particularly painful.
Mistakenly "finding" a branch in predecode might not be that
painful even if predicted taken — similar to a false BTB hit
corrected in decode. Wrongly "finding" an optimizable load
instruction might waste resources and introduce a minor glitch in
decode (where the "instruction" has to be retranslated into an
immediate component).
It *feels* attractive to me to have predecode fill a BTB-like
structure to reduce redundant data storage. Filling the "BTB" with
less critical instruction data when there are few (immediate-
based) branches seems less hurtful than losing some taken branch
targets, though a parallel ordinary BTB (redundant storage) might
compensate. The BTB-like structure might hold more diverse
information that could benefit from early availability; e.g.,
loads from something like a "Knapsack Cache". (Even loads from a
more variable base might be sped by having a future file of two or
three such base addresses — or even just the least significant
bits — which could be accessed more quickly and earlier than the
general register file. Bases that are changed frequently with
dynamic values [not immediate addition] would rarely update the
future file fast enough to be useful. I think some x86
implementations did something similar by adding segment base and
displacement early in the pipeline.) More generally, it seems that
the instruction stream could be parsed and stored into components
with different tradeoffs in latency, capacity, etc.
I do not know if such "aggressive" predecode would be worthwhile
nor what in-memory format would best manage the tradeoffs of
density, parallelism, criticality, etc. or what "L1 cache" format
would be best (with added specialization/utilization tradeoffs).

Date Sujet#  Auteur
9 Mar 24 * Re: 88xxx or PPC7Paul A. Clayton
9 Mar 24 `* Re: 88xxx or PPC6mitchalsup@aol.com (MitchAlsup1)
21 Apr 24  `* Re: 88xxx or PPC5Paul A. Clayton
26 May 24   `* Re: 88xxx or PPC4MitchAlsup1
26 May 24    `* Re: 88xxx or PPC3Paul A. Clayton
26 May 24     `* Re: 88xxx or PPC2Thomas Koenig
27 May 24      `- Re: 88xxx or PPC1MitchAlsup1

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal