Sujet : Re: Efficiency of in-order vs. OoO
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 13. Mar 2024, 20:14:51
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <9d88513f17e6b6aef01eff6130409cbd@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
User-Agent : Rocksolid Light
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
>
My concern is that the circuit for doing this could be pretty complicated.
>
Essentially equal in complexity to an IO retirement µArchitecture.
For my uArch Retire should be quite straight forward to implement.
Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:
- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch register
- increments IQ tail pointer, freeing the entry.
All of these would have been completed when the instruction comes out of its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}
IIRC the Alpha 21064 was 16
carefully tuned
gates per stage so if my Retire unit
could hit 13 gates I'd be extremely chuffed (delighted).
I would likely be targeting 20 gates per stage anyway.
For example, Athlon was a 16-gate machine and Opteron was a 17-gate
machine. The 64-bit* integer adder was 11-gates of delay which had
been carefully tuned so it was at least as fast as 8-random gates
of FO4.
(*) and the 56-bit fraction FADD adder was also 11-gates.
As to gates of delay per stage::
At 20-gates you can run 6-wide forwarding anything goes anywhere and hit
each cache port twice per cycle (generally 1 RD 1 WT). This µArchitecture
shortens the number of retire stages. One can also use register file ports
twice per cycle so a 6-port RF can do 6 RDs and 6 WTs per cycle.
At 16-gates 3-4-wide machines can perform everything goes everywhere forwarding
but cannot run an SRAM twice per cycle {either RD-RD or RD-WT}. It is right on the edge of doable to use your register ports twice per cycle--I would
recommend not trying} 30 years ago with circuit designers tuning gates you could now with gates-only-from-library you cannot.
At 12-gates per stage you cannot perform anything goes anywhere forwarding
{for example an ADD-Btye (x86) could not be forwarded to a 32-bit or 64-bit
integer ADD. Part of the problem is x86 defines byte addition as insert.}
At 8-gates per stage, the integer adder and accessing SRAM both take an
entire cycle, so a LD cannot be shorter than 3-cycles and set associative
caches are often 4-cycles. {So DM caches may actually outperform SA cache}
Decode is at least 2 cycles even on a 1-wide machine. Decode is at least
3-cycles on a GBOoO machine. Forwarding is approximately ½ cycle.
-----------------------------
Having doe designs in each of these arenas:: I lean towards 16-gates on
narrow machines and 20-gates on GBOoO machines.