On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 23:34:22 +0000, Scott Lurndal wrote:
>
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 21:14:08 +0000, Scott Lurndal wrote:
>
mitchalsup@aol.com (MitchAlsup1) writes:
On Thu, 13 Mar 2025 18:34:32 +0000, Scott Lurndal wrote:
>
>
Most modern devices advertise the MSI-X capability instead.
>
And why not:: its just a few more flip-flops and almost no more
sequencing logic.
>
I've seen several devices with more than 200 MSI-X vectors;
thats 96 bits per vector to store a full 64-bit address and
32-bit data payload.
>
At this point, with 200+ entries, flip-flops are not recommended,
instead these would be placed in a RAM of some sort. Since RAMs
come in 1KB and 2KB quanta; we have 1 of 2K and 1 of 1K and we
have 256 said message containers, with 1 cycle access (after
you get to that corner of some chip).
>
That is 200 _per function_. Consider a physical function that
supports the SRIOV capability and configures 2048 virtual
functions. So that's 2049 * 200 MSI-X vectors just for one
device. It's not unusual. These vectors are, of course,
stored on the device itself. Mostly in RAMs.
We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
cache. This means the chip contains 26,624 1KB SRAM macros.
In the above you are complaining that the I/O device can only afford
a few of these, whereas we CPU guys count then in the thousands
(and approaching 100's of thousands.)
You can forgive me when the device guys shudder at the thought of
a dozen 1K SRAM macros as "expensive".....and I don't see their
plight immediately.
>
In My 66000, said device can place those value-holding containers
in actual DRAM should it want to punt the storage.
>
Any device can do that today if it is designed to do so. But that
requires an additional DMA operation to send an MSI-X interrupt;
the device must first read the address and data fields from
host dram (as configured by the device driver) before building
the inbound memory write TLP that gets sent from the device to
the root port.
Was thinking about this last night::
a) device goes up and reads DRAM via L3::MC and DRC
b) DRAM data is delivered to device 15ns later
c) device uses data to send MSI-X message to interrupt 'controller'
d) interrupt controller in L3 sees interrupt
{to hand waving accuracy}
So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across
interconnect, arriving at interrupt service port after 122 ns or
about the equivalent of 600± clocks to log the interrupt into the
table.
The Priority broadcast is going to take another dozen ns, core
request for interrupt will be another dozen to service controller,
even if the service port request is serviced instantaneously,
the MSI-X message does not arrive at core until 72ns after arriving
at service port--for a best case latency on the order of 200 ns
(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)
And that is under the assumption that no traffic interference
is encountered up or down the PCIe trees.
whereas::
if the device DRAM read request was known to contain an MSI-X
message, then a trip up and down the PCIe tree could be completely
eliminated (in terms of interrupt latency)--we still have the
4 round trip latency of the interconnect between service port
and core.
>
Just adds latency and requires the driver to allocate space and
tell the device the base address of the array of vectors.
>
>
This adds latency
and decreases on-die storage. At that point, the device has an
unlimited number of "special things".
>
Most PCIe devices are third party IP, which you must live with.
Captain Obvious strikes again.
Your on-chip PCIe-like devices can do as they wish, unless they're
standard IP such as a Synopsis SATA controller, or third party
network controller PCIe endpoint IP.
>
I will just note that the high-end SoCs with on-chip PCIe-like
devices that I'm familiar with all use SRAMs for the MSI-X vectors
(unless there is only a couple, in which case flops work).
That was my originating assumption that you spent the middle of this
thread dissuading ...
The latency overhead of fetching the vector from DRAM is
prohibitive for high-speed devices such as network controllers.
Here we have the situation where one can context switch in a lower
number of clock cycles than one can deliver an interrupt from
a device to a servicing core.