On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:
Michael S wrote:
On Mon, 17 Mar 2025 13:38:12 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
------------------
>
The problem Robert is talking about arises when there are many
interrupt source and many target CPUs.
The required routing/prioritization/acknowledgment logic (at least a
naive logic I am having in mind) would be either non-scalable or
relatively complicated. Process of selection for the second case will
take multiple cycles (I am thinking about ring).
>
Another problem is what does the core do with the in flight
instructions.
>
Method 1 is simplest, it injects the interrupt request at Retire
as that's where the state of everything is synchronized.
The consequence is that, like exceptions, the in flight instructions all
get purged, and we save the committed RIP, RSP and interrupt control
word.
While that might be acceptable for a 5 stage in-order pipeline,
it could be pretty expensive for an OoO 200+ instruction queue
potentially tossing hundreds of cycles of near finished work.
Lowest interrupt Latency
Highest waste of power (i.e., work)
Method 2 pipelines the switch by injecting the interrupt request at
Fetch.
Decode converts the request to a special uOp that travels down the IQ
to Retire and allows all the older work to complete.
This is more complex as it requires a two phase hand-off from the
Interrupt Control Unit (ICU) to the core as a branch mispredict in the
in flight instructions might cause a tentative interrupt acceptance to
later be withdrawn.
Interrupt latency is dependent on executing instructions,
Lowest waste of power
But note: In most cases, it already took the interrupt ~150 nanoseconds
to arrive at the interrupt service port. 1 trip from device to DRAM
(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
device to interrupt service port; and 4 DRAM (or L3) accesses to log
interrupt into table.
Also, in most cases, the 200-odd instructions in the window will finish
in 100-cycles or as little as 20ns--but if the FDIV unit is saturated,
interrupt latency could be as high as 640 cycles and as long as 640ns.
The ICU believes the core is in a state to accept a higher priority
interrupt. It sends a request to core, which checks its current state
and
sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
NAK.
In My 66000, ICU knows nothing about the priority level (or state)
of any core in the system. Instead, when a new higher priority
interrupt is raised, the ISP broadcasts a 64-bit mask indicating
which priority levels in the interrupt table have pending inter-
rupts with an MMI/O message to the address of the interrupt table.
All cores monitoring that interrupt table capture the broadcast,
and each core decides to negotiate for an (not that) interrupt
by requesting the highest priority interrupt from the table.
When the request returns, and it is still at a higher priority
than the core is running, core performs interrupt control transfer.
If the interrupt is below the core's priority it is returned to
ISP as if NAKed.
Prior to interrupt control transfer, core remains running what-
ever it was running--and all the interrupt stuff is done by state
machines at the edge of the core and the L3/DRAM controller.
When the special uOp reaches Retire, it sends a signal to Fetch which
then sends an INT_ACCEPT signal to ICU to complete the handoff.
If a branch mispredict occurs that causes interrupts to be disabled,
then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
(Yes that is not optimal - make it work first, make it work well
second.)
>
This also raises a question about what the ICU is doing during this
long latency handoff. One wouldn't want ICU to sit idle so it might
have to manage the handoff of multiple interrupts to multiple cores
at the same time, each as its own little state machine.
One must assume that ISP is capable of taking a new interrupt
from a device every 5-ish cycles and interrupt handoff is in the
range of 50 cycles, and that each interrupt could be to a different
interrupt table.
My 66000 ISP treats successive requests to any one table as strongly
ordered, and requests to different tables as completely unordered.
One should see that this decision on how the core handles the
handoff has a large impact on the design complexity of the ICU.
I did not "see" that in My 66000's interrupt architecture. The ISP
complexity is fixed, and the core's interrupt negotiator is a small
state machine (~10-states).
ISP essentially performs 4-5 64-bit memory accesses, and possibly
1 MMI/O 64-bit broadcast on arrival of MSI-X interrupt. Then if
a core negotiates, it performs 3 more memory accesses per negotiator.