Article <vuj131$fnu$1@gal.iecc.com>

Deutsch English Français Italiano
<vuj131$fnu$1@gal.iecc.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.iecc.com!.POSTED.news.iecc.com!not-for-mail
From: John Levine <johnl@taugh.com>
Newsgroups: comp.arch
Subject: DMA is obsolete
Date: Sat, 26 Apr 2025 16:19:45 -0000 (UTC)
Organization: Taughannock Networks
Message-ID: <vuj131$fnu$1@gal.iecc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 26 Apr 2025 16:19:45 -0000 (UTC)
Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970";
	logging-data="16126"; mail-complaints-to="abuse@iecc.com"
Cleverness: some
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: johnl@iecc.com (John Levine)
Bytes: 2746
Lines: 35

Well, not entirely.  This preprint argues that in environments with
lots of cores and where latency is an issue, programmed I/O can outperform
DMA.

Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects

Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe

Conventional wisdom holds that an efficient interface between an OS
running on a CPU and a high-bandwidth I/O device should use Direct
Memory Access (DMA) to offload data transfer, descriptor rings for
buffering and queuing, and interrupts for asynchrony between cores and
device. In this paper we question this wisdom in the light of two
trends: modern and emerging cache-coherent interconnects like CXL3.0,
and workloads, particularly microservices and serverless computing.
Like some others before us, we argue that the assumptions of the
DMA-based model are obsolete, and in many use-cases programmed I/O,
where the CPU explicitly transfers data and control information to and
from a device via loads and stores, delivers a more efficient system.
However, we push this idea much further. We show, in a real hardware
implementation, the gains in latency for fine-grained communication
achievable using an open cache-coherence protocol which exposes cache
transitions to a smart device, and that throughput is competitive with
DMA over modern interconnects. We also demonstrate three use-cases:
fine-grained RPC-style invocation of functions on an accelerator,
offloading of operators in a streaming dataflow engine, and a network
interface targeting serverless functions, comparing our use of
coherence with both traditional DMA-style interaction and a
highly-optimized implementation using memory-mapped programmed I/O
over PCIe.

https://arxiv.org/abs/2409.08141
-- 
Regards,
John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies",
Please consider the environment before reading this e-mail. https://jl.ly