Sujet : Re: MSI interrupts
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 14. Mar 2025, 19:12:23
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <aceeec2839b8824d52f0cbe709af51e1@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : Rocksolid Light
On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:
>
>
>
We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
cache. This means the chip contains 26,624 1KB SRAM macros.
>
You've a lot more area to work with, and generally a more
recent process node.
>
>
>
Was thinking about this last night::
a) device goes up and reads DRAM via L3::MC and DRC
b) DRAM data is delivered to device 15ns later
>
15ns? That's optimistic and presumes a cache hit, right?
See the paragraph below {to hand waving accuracy} for a more
reasonable guestimate of 122ns from device back to device
just reading the MSI-X message and address.
Don't forget to factor in PCIe latency (bus to RC and RC to endpoint).
>
c) device uses data to send MSI-X message to interrupt 'controller'
d) interrupt controller in L3 sees interrupt
>
{to hand waving accuracy}
So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across
interconnect, arriving at interrupt service port after 122 ns or
about the equivalent of 600± clocks to log the interrupt into the
table.
>
The Priority broadcast is going to take another dozen ns, core
request for interrupt will be another dozen to service controller,
even if the service port request is serviced instantaneously,
the MSI-X message does not arrive at core until 72ns after arriving
at service port--for a best case latency on the order of 200 ns
(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)
>
And that is under the assumption that no traffic interference
is encountered up or down the PCIe trees.
>
whereas::
>
if the device DRAM read request was known to contain an MSI-X
message,
>
You can't know that a priori,
yes, I know that:: but if you c o u l d . . . you could save roughly
1/2 of the interrupt delivery to core latency.
it's just another memory write
(or read if you need to fetch the address and data from DRAM)
TLP as part of the inbound DMA. Which needs to hit the IOMMU
first to translate the PCI memory space address to the host
physical address space address.
>
If the MSI-X tables were kept in DRAM, you also need to include
the IOMMU translation latency in the inbound path that fetches
the vector address and vector data (96-bits, so that's two
round trips from the device to memory). For a virtual function,
the MSI-X table is owned and managed by the guest, and all
transaction addresses from the device must be translated from
guest physical addresses to host physical addresses.
>
A miss in the IOMMU adds a _lot_ of latency to the request.
>
So, that's three round trips from the device to the
Uncore/RoC just to send a single interrupt from the device.
3 dozen-ns traversals, not counting actual memory access time.
Then another dozen ns traversal and enQueueing in the interrupt
table. Then 3 round dozen-ns trips on the on-die interconnect.
It all adds up.
>
The latency overhead of fetching the vector from DRAM is
prohibitive for high-speed devices such as network controllers.
>
Here we have the situation where one can context switch in a lower
number of clock cycles than one can deliver an interrupt from
a device to a servicing core.
>
Device needs to send an interrupt when vectors stored in host DRAM
instead of internal SRAM or flops:
>
- send non-posted MRD TLP to RC to fetch MSI-X address
- receiver (pcie controller (RC), for example) passes
MRD address to IOMMU for translation (assuming
the device and host don't implement ATS),
IOMMU translates (table walk latency) the
address from the TLP to a host physical
address (which could involve two levels of
translation, so up to 22 DRAM accesses (intel/amd/aarch64)
on IOMMU TLB miss). The latency is dependent
up on the IOMMU table format - Intel has EPT
while ARM and AMD use the same format as the CPU
page tables for the IOMMU tables.
(this leaves out any further latency hit when
using the PCI Page Request Interface (PRI) to make
the target page resident).
- LLC/DRAM satisfies the MRD and returns data to
PCIe controller, which sends a completion TLP
to device. LLC (minimum), DRAM (maximum) latency added.
- RC/host sends response with address to device
- Device sends non-posted MRD TLP to RC to fetch MSI-X Data
(32-bit). Again with the IOMMU, but will likely
hit TLB. Lesser latency than a miss, but nonzero.
- RC returns completion TLP to device.
- Device sends MWR TLP (data payload) with the translated
address to the root complex, which passes it to the
internal bus structure for routing to the final address
(interrupt controller).
- Then add the latency from the interrupt controller to
the target core (which may include making the target guest
resident).
>
That's a whole pile of latency to send an interrupt.
I bet the MSI-X messages would cache on the device rather well ...
as they change roughly at the rate of VM creation.