Sujet : Re: Reservation stations [was Continuations]
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 21. Jul 2024, 20:44:49
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <a0f443093d1a10de29650d34ac74a70e@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Rocksolid Light
On Sun, 21 Jul 2024 16:28:43 +0000, EricP wrote:
MitchAlsup1 wrote:
On Thu, 18 Jul 2024 0:48:18 +0000, EricP wrote:
>
MitchAlsup1 wrote:
>
{Would be an interesting reservation station design, though}
>
In what way would the RS be interesting or different?
>
The instruction stream consists of 4 FMAC-bound instructions unrolled
as many times as will fit in register file.
>
You typical reservation station can accept 1 new instruction per cycle
from the decoder. So, either the decoder has to spew the instructions
across the stations (and remember they are data dependent) or the
station has to fire more than one per cycle to the FMAC units.
>
So, instead of 1-in, 1-out per cycle, you need 4-in 4-out per cycle
and maybe some kind of exotic routing.
>
This is where I saw a benefits to using valued reservation stations vs
valueless ones - when a uArch has multiple similar FU each with its own
bank of RS that is scheduled for that FU.
>
Example of horizontal scaling of similar FU each with its own RS bank.
https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/07/cheese_oryon_diagram_revised.png
>
With valueless RS, each RS stores only the source register number of
its operands and each FU has to be able to read all its operands
when a uOp launches (begins execution). This means the number of
PRF read ports scales according to the total number of FU operands.
(One could do read port sharing but then you have to schedule that too
and could have contention.) Also if an FU is unused on any cycle then
all its (expensive) operand read ports are unused.
I always had RSs keep tack of which FU was delivering the final
operand, so that these could be picked up by the forwarding logic
and not need a RF port. This gets rid of 50%-75% of the RF port
needs.
>
Using the above Oryon as an example, with valueless RS, to launch
all 14 FU with 3 operands all at once needs 42 read ports.
>
With valued RS the operand values stored in each RS and, if ready,
read at Dispatch (hand-off from the front end to the RS bank) or are
received from the forwarding network if in-flight at Dispatch time.
Delivering result at dispatch time.
The number of PRF read ports scales with the number of dispatched uOp
operands. Since the operand values are stored in each RS, each bank
can then schedule and launch independently.
The width of the decoder is narrower than the width of the data path.
We used to call this "catch up bandwidth".
>
With valued RS, to Dispatch 6 wide with 3 operands needs 18 read ports,
First, a 6-wide machine is not doing 6 3-operand instructions,
it is more like 3-memory ops (2-reg+displacement), one 3-op,
one general 2-op, and one 1-op (branch) so, you only need 12-ports
instead of 18 Most of the time.
The penalty is that each RS entry is 5× the size of the value-free
RS designs. These work just fine when the execution window is
reasonable (say 96 instructions) but fails when the window is
larger than 150-ish.
and the read ports are potentially usable for all dispatches.
Then all 14 FU can launch at once independently.
One should also note that these machines deliver 1-2 I/c RMS
regardless of their Fetch-Decode-FU widths.
>
Each FU can also have two kinds of valued RS banks,
a simple one if all the operands are ready at Dispatch as this does
not need a wake-up matrix entry or need to receive forwarded values,
and a complex one that monitors the wake-up matrix and forwarding buses.
If all the operands are ready, the Dispatcher can choose either RS bank
for
the FU, giving preference to the simpler. If all operands are not ready
then Dispatcher selects from the complex bank.