Sujet : Re: Instruction Tracing
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 11. Aug 2024, 22:09:02
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <84999afd1377326f1e5e96040c46b992@www.novabbs.org>
References : 1 2 3 4
User-Agent : Rocksolid Light
On Sun, 11 Aug 2024 14:44:38 +0000, Anton Ertl wrote:
John Levine <johnl@taugh.com> writes:
As far as the delayed branches and such, they made sense in the narrow
time window when it was too expensive to put a cache on a workstation
but that time came and went by the time the RT shipped.
>
Delayed branches were put in the first commercial generation of RISCs
(except ARM), which all shipped with caches (except ARM). Delayed
branches are a natural consequence of the 5-stage (Or, in the 88100
case, four-stage) pipeline.
Delayed branches are wonderful to the pipeline, very much less so for
the architecture overall as it makes wide issue "all that much harder"
It was truly a pain in the ass on Mc88120 a 6-wide machine.
Neither nullification or inverse nullification helped much and both
hurt at wide issue, too. At least Mc88100 had a bit to indicate
the delay slot was not being used.
Looking back, I wish we had not been forced to do them--I think many
of the 1st generation architects wish similarly. Delayed branches
were supposed to bring a 16% gain in performance. After looking at
the utility rates slightly less than 50% useful instructions, with
something slightly over 70% fill rate; they only brought 8%-ish.
{{A useful instruction is useful in both taken and non-taken paths.}}
IIRC ARM used a 3-stage implementation for the ARM1/2, which may be a
consequence of them rejecting delayed branches; and they did not have
caches, so they could not have made use of the higher clock rate that
a longer pipeline could have affored. So it seems that the connection
between cache and delayed branches, if there is any, is the opposite
of what you suggest.
>
Delayed branches provided a speedup on these early 5-stage
implementations. They also provided a big headache for more
sophisticated implementations, and therefore soon fell out of favour.
Much like virtual caches...
The only thing that has persisted is LDs being longer than 2 cycles.
Squashing {forward, ADD, SRAM, LDalign} into 2 cycles is proving
to be a frequency headache in the simpler RISC-V implementations
even now. with wires getting slower and gates getting faster, that
trade off is getting worse. Many of the Intel x86s use 4 cycle LDs.
{the cost of frequency is efficiency}
Power (IIRC) and Alpha don't have delayed branches.
Non of the modern RISCs have them either.
- anton