Sujet : Re: DMA is obsolete
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 26. Apr 2025, 20:25:05
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <da5b3dea460370fc1fe8ad2323da9bc4@www.novabbs.org>
References : 1 2
User-Agent : Rocksolid Light
On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote:
John Levine <johnl@taugh.com> writes:
Well, not entirely. This preprint argues that in environments with
lots of cores and where latency is an issue, programmed I/O can
outperform
DMA.
>
Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent
Interconnects
>
Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
>
<snip abstract>
>
>
https://arxiv.org/abs/2409.08141
>
Interesting article, thanks for posting.
>
Their conclusion is not at all surprising for the operations they target
in the paper. PCI express throughput has increased with each
generation, and PCI express latencies have decreased with each
generation. There are certain workloads enabled by CXL that
benefit from reduced PCIe latency, but those are primarily
aimed at increasing directly accessible memory.
>
However, I expect there are still benefits in using DMA for bulk data
transfer, particularly for network packet handling where
throughput is more interesting than PCI MMIO latency.
I would like to add a though to the concept under discussion::
Does the paper's conclusion hold better or worse if/when the
core ISA contains both LDM/STM and MM instructions. LDM/STM
allow for several sequential registers to move to/from MMI/O
memory in a single interconnect transaction, while MM allows
for up-to page-sized transfers in a single instruction and
only 2 interconnect transactions.
One concern that arises from the paper are the security
implications of device access to the cache coherency
protocol. Not an issue for a well-behaved device, but
potentially problematic in a secure environment with
third-party CXL-mem devices.
Citation please !?!
Also note:: device DMA goes through I/O MMU which adds a
modicum of security-fencing around device DMA accesses
but also adding latency.
At 3Leaf systems, we extended the coherency domain over
IB or 10Gbe Ethernet to encompass multiple servers in a
single coherency domain, which both facilitated I/O
and provided a single shared physical address space across
multiple servers (up to 16). CXL-mem is basically the same
but using PCIe instead of IB.
IB == InfiniBand ?!?
Granted, that was close to 20 years ago, and switch latencies
were significant (100ns for IB, far more for Ethernet).
>
CXL-mem is a similar technology with a different transport (we
looked at Infiniband, 10Ge ethernet and "advanced switching"
(a flavor of PCIe)). Infiniband was the most mature of the
three technologies and switch latencies were signifincantly lower
for IB than the competing transports.
>
Today, my CPOE sells a couple of CXL2.0 enabled PCIe devices for
memory expansion; one has 16 high-end ARM V cores.
>
Quoting from the article (p.2)
" As a second example: for throughput-oriented workloads
DMA has evolved to efficiently transfer data to and from
main memory without polluting the CPU cache. However, for
small, fine-grained interactions, it is important that almost all
the data gets into the right CPU cache as quickly as possible."
>
Most modern CPU's support "allocate" hints on inbound DMA
that will automatically place the data in the right CPU cache as
quickly as possible.
>
Decomposing that packet transfer into CPU loads and stores
in a coherent fabric doesn't gain much, and burns more power
on the "device" than a DMA engine.
That was my initial thought--core performing lots of LD/ST to
MMI/O is bound to consume more power than device DMA.
Secondarily, using 1-few cores to perform PIO is not going to
have the data land in the cache of the core that will run when
the data has been transferred. The data lands in the cache doing
PIO and not in the one to receive control after I/O is done.
{{It may still be "closer than" memory--but several cache
coherence protocols take longer cache-cache than dram-cache.}}
It's interesting they use one of the processors (designed in 2012) that
we
built over a decade ago (and they mispell our company name :-)
in their research computer. That processor does have
a mechanism to allocate data in cache on inbound DMA[*]; it's
worth noting that the 48 cores on that processor are in-order
cores. The text comparing it with a modern i7 at 3.6ghz
doesn't note that.
>
[*] Although I don't recall if that mechanism was documented
in the public processor technical documentation.
>
Their description of the Thunder X-1 processor cache is not accurate;
it's not PIPT, it is PIVT (implemented in such a way as to
appear to software as if it were PIPT). The V drops a cycle
off the load-to-use latency.
Generally its VIPT virtual index-physical tag with a few bits
of virtual-aliasing to disambiguate P&s.
It was also the only ARM64 processor chip we built with a cache-coherent
interconnect until the recent CXL based products.
>
Overall, a very interesting paper.
Reminds me of trying to sell a micro x86-64 to AMD as a project.
The µ86 is a small x86-64 core made available as IP in Verilog
where it has/runs the same ISA as main GBOoO x86, but is placed
"out in the PCIe" interconnect--performing I/O services topo-
logically adjacent to the device itself. This allows 1ns access
latencies to DCRs and performing OS queueing of DPCs,... without
bothering the GBOoO cores.
AMD didn't buy the arguments.