Article <vuj53m$2s0jv$1@dont-email.me>

Deutsch English Français Italiano
<vuj53m$2s0jv$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: DMA is obsolete
Date: Sat, 26 Apr 2025 19:28:21 +0200
Organization: A noiseless patient Spider
Lines: 42
Message-ID: <vuj53m$2s0jv$1@dont-email.me>
References: <vuj131$fnu$1@gal.iecc.com>
 <slrn100q2dv.eisl.lars@cleo.beagle-ears.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 26 Apr 2025 19:28:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5e2967bf5c7bd177c5e627af12d074d3";
	logging-data="3015295"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+w13I/pfzBjGhSquyJ+v51dny5c0i9ZlBPtLYjrzMK5w=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101
 Firefox/128.0 SeaMonkey/2.53.20
Cancel-Lock: sha1:dGELwokkKlRQbeUo/8ZB33qNtr0=
In-Reply-To: <slrn100q2dv.eisl.lars@cleo.beagle-ears.com>
Bytes: 2670

Lars Poulsen wrote:
> On 2025-04-26, John Levine <johnl@taugh.com> wrote:
>> Well, not entirely.  This preprint argues that in environments with
>> lots of cores and where latency is an issue, programmed I/O can outperform
>> DMA.
>>
>> Rethinking Programmed I/O for Fast Devices, Cheap Cores, and Coherent Interconnects
>>
>> Anastasiia Ruzhanskaia, Pengcheng Xu, David Cock, Timothy Roscoe
[snip]
>>
>> https://arxiv.org/abs/2409.08141
> 
> What is the difference between DMA and message-passing to another core
> doing CMOV loop at the ISA level?
> 
> DMA means doing that it the micro-engine instead of at the ISA level.
> Same difference.
> 
> What am I missing?
> 

I think, in the end it all comes down to power:

If the DMA engine can move n GB of data using less total power than 
having a regular core do it with programmed IO, then the DMA engine wins.

OTOH, I have argued here in c.arch that for most data input streams, a 
regular core is going to look at the data eventually, and in that case 
the same core can do the work and either process it directly (in 
register file sized or smaller blocks)or work as a prefetcher to first 
load up  $L1-sized blocks and then process that chunk.

On the gripping hand, if this is either going out, or you only need to 
look at a small percentage of the incoming cache lines worth of data, 
then the more power-efficient DMA engine can still win.

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"