Deutsch English Français Italiano |
<vuj131$fnu$1@gal.iecc.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.iecc.com!.POSTED.news.iecc.com!not-for-mail From: John Levine <johnl@taugh.com> Newsgroups: comp.arch Subject: DMA is obsolete Date: Sat, 26 Apr 2025 16:19:45 -0000 (UTC) Organization: Taughannock Networks Message-ID: <vuj131$fnu$1@gal.iecc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Sat, 26 Apr 2025 16:19:45 -0000 (UTC) Injection-Info: gal.iecc.com; posting-host="news.iecc.com:2001:470:1f07:1126:0:676f:7373:6970"; logging-data="16126"; mail-complaints-to="abuse@iecc.com" Cleverness: some X-Newsreader: trn 4.0-test77 (Sep 1, 2010) Originator: johnl@iecc.com (John Levine) Bytes: 2746 Lines: 35 Well, not entirely. This preprint argues that in environments with lots of cores and where latency is an issue, programmed I/O can outperform DMA. Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe Conventional wisdom holds that an efficient interface between an OS running on a CPU and a high-bandwidth I/O device should use Direct Memory Access (DMA) to offload data transfer, descriptor rings for buffering and queuing, and interrupts for asynchrony between cores and device. In this paper we question this wisdom in the light of two trends: modern and emerging cache-coherent interconnects like CXL3.0, and workloads, particularly microservices and serverless computing. Like some others before us, we argue that the assumptions of the DMA-based model are obsolete, and in many use-cases programmed I/O, where the CPU explicitly transfers data and control information to and from a device via loads and stores, delivers a more efficient system. However, we push this idea much further. We show, in a real hardware implementation, the gains in latency for fine-grained communication achievable using an open cache-coherence protocol which exposes cache transitions to a smart device, and that throughput is competitive with DMA over modern interconnects. We also demonstrate three use-cases: fine-grained RPC-style invocation of functions on an accelerator, offloading of operators in a streaming dataflow engine, and a network interface targeting serverless functions, comparing our use of coherence with both traditional DMA-style interaction and a highly-optimized implementation using memory-mapped programmed I/O over PCIe. https://arxiv.org/abs/2409.08141 -- Regards, John Levine, johnl@taugh.com, Primary Perpetrator of "The Internet for Dummies", Please consider the environment before reading this e-mail. https://jl.ly