Sujet : Re: Efficiency of in-order vs. OoO
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 13. Mar 2024, 16:34:47
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <8eddf637ee73fe0e4b0f3e0f634e7234@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
User-Agent : Rocksolid Light
EricP wrote:
MitchAlsup1 wrote:
EricP wrote:
My concern is that the circuit for doing this could be pretty complicated.
Essentially equal in complexity to an IO retirement µArchitecture.
For my uArch Retire should be quite straight forward to implement.
Retire reads the tail (oldest) entry in the Instruction Queue (IQ) and
checks if the Done flag is set. If it is and the entry's Exception flag
is clear:
- if instruction was not a taken branch Retire adds the instruction
length to the committed RIP register.
- else if it is a taken branch Retire pops the new committed RIP from
the tail of the branch queue in the Branch Control Unit.
- it clears the Architecture Reg flag on the old dest physical register
(which also frees it) and sets it on the new dest physical register
- updates the Committed-RAT with the new dest register for the Arch register
- increments IQ tail pointer, freeing the entry.
All of these would have been completed when the instruction comes out of its function unit, and then retire multiplexes this data onto the
current retired instruction state. {2-gates not 13-gates}
If the entry's Exception flag is set then it is also straight forward,
with a flush of all in-flight instructions, bulk copy the Committed-RAT
into the Future-RAT to restore renaming, and set a jump address in Fetch.
(Any in-flight cache miss operations are allowed to complete.)
This is also relatively straight forward to do multiple retires per clock,
each mostly costs an extra read port on IQ and extra write ports on the
Committed-RAT and the Physical Register Status register.
Many of the pieces that have to be checked are scattered around the core.
Also many of states are in circular buffers so determining "older" starts
getting slightly hairy (the Load Store Queue has a similar problem for
disambiguation determining if all older loads and stores have "resolved").
And all this has to run in parallel so it takes less than 1 clock.
Adding the structures to support OoO Retire would greatly complicate this.