Newsportal USENET - Re: PCIe MSI-X interrupts

Scott Lurndal wrote:
First of all, allow me to express my gratitude in such a well
though out response, compared to the miscellaneous ramblings
going on in my head.

mitchalsup@aol.com (MitchAlsup1) writes:
PCIe has an MSI-X interrupt 'capabillity' which consists of
a number (n) interrupt desctiptors and an associated Pending
Bit Array where each bit in PBA has a corresponding 128-bit
desctiptor. A descriptor contains a 64-bit address, a 32-bit
message, and a 32-bit vector control word. >
There are 2-levels of enablement, one at the MSI-X configura-
tion control register and one in each interrupt descriptor at
vector control bit[31].
>
As the device raises an interrupt, it sets a bit in PBA.
>
When MSI-X is enabled and a bit in PBA is set (1) and the
vector control bit[31] is enabled, the device sends a
write of the message to the address in the descriptor,
and clears the bit in PBA.

Note that if the interrupt condition is asserted after the
global enable in the MSI-X capability and the vector enable
have both been set to allow delivery, the message will be sent to
the root complex and PBA will not be updated. (P is for
pending, and once the message is sent, it's no longer
pending). PBA is only updated when the interrupt is masked
(either function-wide in the capability or per-vector).

So, the interrupt only becomes pending in BPA if it cannot be sent immediately. Thanks for the clarification.

>
I am assuming that the MSI-X enable bit is used to throttle

In my experience the MSI-X function enable and vector enables
are not modified during runtime, rather the device has control
registers which allow masking of the interrupt (e.g.
for AHCI, the MSI message will only be sent if the port
PxIE (Port n Interrupt Enable) bit corresponding to a
PxIS (Port n Interrupt Status) bit is set).

So, these degenerated into more masking levels that are not
used very often because other masks can be applied elsewhere.

Granted, AHCI specifies MSI, not MSI-X, but every MSI-X
device I've worked with operates the same way, with
device specific interrupt enables for a particular vector.

a device so that it sends bursts of interrupts to optimize
the caching behavior of the cores handling the interrupts.
run applications->handle k interrupts->run applications.
A home machine would not use this featrue as the interrupt
load is small, but a GB server might more control over when.
But does anybody know ??

Yes, we use MSI-X extensively. See above.

There are a number of mechanisms used for interrupt moderation,
but all generally are independent of the PCI message delivery.
(e.g. RSS spreads interrupts across multiple target cores,
or the Intel 10Ge network adapters interrupt moderation feature).

>
a) device dommand to interrupt descriptor mapping {
Thre is no mention of the mapping of commands to the device
and to these interrupt descriptors. Can anyone supply input
or pointers to this mapping.

Once the message leaves the device, is received by the
root complex port and is forwarded across the host bridge
to the system fabric, it's completely under control of
the host. On x86, the TLP for the upstream message is
received and forwarded to the specified address (which is
the IOAPIC on Intel and the GIC ITS on Arm64).

The interrupt controller may further mask the interrupt if
desired or if the interrupt priority is lower than the
current running priority.

{note to self:: that is why its a local APIC--it has to be close
enough to see the core's priority.}
Question:: Down below you talk of the various interrupt control-
lers routing an interrupt <finally> to a core. What happens if the core has changed its priority by the time the interrupt signal arrives, but before it can change the state of the tables in the
interrupt controller that routed said interrupt here ?

>
A single device (such as a SATA drive) might have a queue of
outstanding commands that it services in whatever order it
thinks best. Many of these commands want to inform some core
when the command is complete (or cannot be completed). To do
this, device sends a stored interrupt messages to the stored service port.

Each SATA port has an PxIS and PxIE register. The SATA (AHCI)
controller
MSI configuration can provide one vector per port - the main
difference between MSI and MSI-X is that the interrupt numbers
for MSI must be consecutive and there is only one address;
while for MSI-X each vector has an unique address and a programmable
data (interrupt number) field. The interpretation of the data
of the MSI-X or MSI upstream write is up to the interrupt controller
and may be virtualized in the interrupt controller.

I see (below) that you (they) migrated all the stuff I though might
be either in the address or data to the "other side" of HostBridge.
Fair enough.
For what reason are there multiple addresses ? instead of a range
of addresses providing a more globally-scoped service port ?
Perhaps it is an address at the interrupt descriptor, and an
address range at the global interrupt controller. Where different
addresses then mean different things.

Note that support for MSI in AHCI is optional (in which case the
legacy level sensitive PCI INTA/B/C/D signals are used).

The AHCI standard specification (ahci_1_3.pdf) is available publically.

}
I don't really NEED to know this mapping, but knowing would
significantly enhance my understanding of what is supposed to be going on, and thus avoid making crippling errors.
>
b) address space of interrupt service port {
The address in the interrupt descriptor points at a service port (APIC). Since a service port is "not like memory"*, I
want to mandate this aqddress be in MMI/O space, and since My 66000 has a full 64-bit address space for MMI/O there is no burden on the size of MMI/O space--it is already as big
as possible on a 64-bit machine. Plus, MMI/O space has the property of being sequentially consistent whereas DRAM is
only cache consistent.

From the standpoint of the PCIexpress root port, the upstream write
generated by the device to send the MSI message to the host
looks just like any other inbound DMA from the device to the
host. It is the responsibility of the host bridge and interconnect to
route the message the appropriate destination (which generally
is an interrupt controller, but just as legally could be a
DRAM address which software polls periodically).

So the message arrive at the top of the PCIe tree is RAW, then
the address gets translated by I/O MMU, and both translated address and RAW data are passed forward to its fate.

>
Most current architectures just partition a hunk of the physical address space as MMI/O address space.

The address field in the MSI-X vector (or MSI-X capability)
is opaque to hardware below the PCIe root port.

Our chips recognize the interrupt controller range of
addresses in the inbound message at the host bridge
and route the message to the interrupt translation service;
the destinations in the interrupt controller are simply
control and status registers in the MMIO space. The
ARM64 interrupt controller supports multiple destinations
with different semantics (SPI and xSPI have one target
register and LPI has a different target register the address
of which is programmed into the MSI-X Vector address field).

What I am trying to do is to figure out a means to route the
message to a virtual core's interrupt table such that:: if that
virtual core happens to be running on any physical core, that
the physical core sees the interrupt without delay, and if
the virtual core is not running, the event is properly logged
so when the virtual core runs on a physical core that those
ISRs are performed before any lower priority work is performed.
{and make this work for any number of physical cores and any
number of virtual cores; where cores can sharing interrupt tables. For example, Guest OS[k] thinks that is has 13 cores
and shares its interrupt table across 5 of them, but HyperVisor
remains free to time slice Guest OS[k] cores any way it likes.}

>
(*) memory has the property that a read will return the last
bit pattern written, a service port does not.
>
I assume that service port addresses map to different cores (or local APICs of a core).

The IOAPIC handles the message and has configuration registers
that determine which lAPIC should be signalled.

The GIC has configuration tables in memory that can remap
the interrupt to a different vector (e.g. for a guest VM).

GIC = Global Interrupt Controller ?

I want to directly support the
notion of a virtual core so while a 'chip' might have a large
number of physical cores, one would want a pool of thousands+ of virtual cores. I want said service ports to support raising interrupt directly to a physical or virtual core.

Take a look at IHI0069
(https://developer.arm.com/documentation/ihi0069/latest/)

}
>
Apparently, the message part of the MSI-X interrupt can be interpreted any way that both SW and HW agree.

Yes.

This works
for already defined architectures, and doing it like one
or more others, makes an OS port significantly easier.
However what these messages contain is difficult to find
via Google.

The message is a 32-bit field and it is fully interpreted by
the interrupt controller (The GIC can be configured to support
from 16 to 32-bits data payload in an upstream MSI-X write;
the interpretation of the data is host specific).

On intel and ARM systems, the firmware knows the grungy details
and simply passes the desired payload value to the kernel
via the device tree(linux) or ACPI tables (for windows/linux).
>
So, it seems to me, that the combination of the 64-bit address
and the 32-bit message must provide::
a) which level of the system to interrupt
{Secure Monitor, HyperVisor, SuperVisor, Application}

No. That's completely a function of the interrupt controller
and how the hardware handles the data payload.

b) which core should handle the interrupt
{physical[0..k], virtual[l..m]}

Again, a function of the interrupt controller.

c) what priority level is the interrupt.
{There are 64 unique priority levels}

Yep, a function of the interrupt controller.

d) something about why the interrupt was raised

The interrupt itself causes the operating system
device driver interrupt function to be invoked. The
device-specific interrupt handler determines both
why the interrupt was raised (e.g. via the PxIS
register in the AHCI/SATA controller) and takes
the appropriate action.

On ARM64, it is common for the data field for
the MSI-X interrupts to number starting at zero
on every device, and they're mapped to a system-wide
unique value by the interrupt controller (e.g.
the GICv4 ITS).

I was expecting that.

If interrupt remapping hardware is
not available then unique data payloads for each
device need to be used.

Note that like any other inbound DMA, the address
in the MSI-X TLP that gets sent to the host bridge is subject
to translation by an IOMMU before getting to the
interrupt controller (or by the device itself if it
supports PCI-e Address Translation Services (ATS)).

Obviously.

{what remains of the meassage}
>
I suspect that (a) and (b) are parts of the address while (c)
and (d) are part of the message. Although nothing prevents
(c) from being part of the address.
>
Once MSI-X is sorted out MSI becomes a subset.
>
HostBridge has a service port that provides INT[A,B,C,D] to
MSI-X translation, so only MSI-X message are used system-
wide.

Note that INTA/B/C/D are level-sensitive. This requires
TWO MSI-X vectors - one that targets an "interrupt set"
register and the other targets and "interrupt clear"
register.

Gotcha.

>
------------------------------------------------------------
>
It seems to me that the interrupt address needs translation
via I/O MMU, but which of the 4 levels provides the trans-
lation Root pointers ??

On Intel the IOMMU translation tables are not shared with the
AP.

I have seen in the past 3 days AP being used to point at a
random device out on the PCIe tree and of the unprivileged
application layer. Both ends of the spectrum. Which is your
usage ?

The PCI address (aka Stream ID) is passed to the interrupt
controller and IOMMU and used as an index to determine the
page table root pointer.

The stream id format is

<2:0> PCI function number
<7:3> PCI device number
<15:8> PCI bus number
<xx:16> PCI segment (root complex) number.

I use ChipID for the last field in case each chip has its own
PCIe tree. {Except that the bits are placed elsewhere in the address.}
But (now with the new CXL) instead of allocating 200+ pins
to DRAM those pins can be allocated to PCIe links; making any
chip much less dependent on which DRAM technology, which chip-
to-chip repeaters,... So, the thought is all I/O is PCIe + CXL;
and about the only other pins chip gets are RESET and ClockIn.
Bunches of these pins can be 'configured' into standard width
PCIe links (at least until one runs out of pins.)
Given that one has a PCIe root complex with around 256-pins
available, does one need multiple roots of such a wide tree ?

This allows each device capable of inbound DMA to identify
themselves uniquely to the interrupt controller and IOMMU.

Both intel and AMD use this convention.

>
Am I allowed to use bits in Vector Control to provide this ??
But if I put it there then there is cross privilege leakage !

No, you're not allowed to do anything not explicitly allowed
in the PCI express specification. Remember, an MSI-X write
generated by the device is indistinguishable from any other
upstream DMA request initiated by the device.

Why did PCI committee specify a 32-bit container and define the
use on only 1 bit ?? Or are more bits defined but I just haven't
run into any literature concerning those ?

>
c) interupt latency {
When "what is running on a core" is timesliced by a HyperVisor,
a core that launched a command to a device may not be running
at the instant the interrupt arrives back.

See again the document referenced above. The interrupt controller
is aware that the guest is not currently scheduled and maintains
a virutal pending state (and can optionally signal the hypervisor
that the guest should be scheduled ASAP).

Are you using the word 'signal' as LINUX signal delivery, or as
a proxy for interrupt of some form, or perhaps as an SVC to HV of some form ?

Most of this is done completely by the hardware, without any
intervention by the hypervisor for the vast majority of
interrupts.

That is the goal.

>
It seems to me, that the HyperVisor would want to perform ISR
processing of the interrupt (low latency) and then schedule
the softIRQs to the <sleeping> core so when it regains control
the pending I/O stack of "stuff" is proprly cleaned up.
>
So, shold all initerrupt simple go to HyperVisor and let HV
sort it all out? Or can the <sleeping> virtual core just deal
with it when it is given a next time slice ??

The original GIC did something like this (the HV took all
interrupts and there was a hardware mechanism to inject them
into a guest as if they were a hardware interrupt). But
it was too much overhead going through the hypervisor, especially
when the endpoint device support the SRIOV capability. So the
GIC supports handling virtual interrupt delivery completely
in hardware unless the guest is not currently resident on any
virtual CPU.

Leave HV out of the loop unless something drastic happens.
I/O completion and I/O aborts are not that drastic.
Once again, I thank you greatly for your long and informative
post.

Date	Sujet	#	Auteur
21 Jun 24	PCIe MSI-X interrupts	24	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	7	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	3	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	2	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	2	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
22 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
25 Jun 24	Re: PCIe MSI-X interrupts	16	MitchAlsup1
25 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
27 Jun 24	Re: PCIe MSI-X interrupts	14	MitchAlsup1
27 Jun 24	Re: PCIe MSI-X interrupts	12	Michael S
27 Jun 24	Re: PCIe MSI-X interrupts	9	MitchAlsup1
28 Jun 24	Re: PCIe MSI-X interrupts	3	MitchAlsup1
30 Jun 24	Re: PCIe MSI-X interrupts	2	George Neuner
30 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
28 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
10 Jul 24	Re: PCIe MSI-X interrupts	4	MitchAlsup1
10 Jul 24	Re: PCIe MSI-X interrupts	2	Kent Dickey
10 Jul 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
28 Jul 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
1 Jul 24	Re: PCIe MSI-X interrupts	2	aph
4 Jul 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1
27 Jun 24	Re: PCIe MSI-X interrupts	1	MitchAlsup1