Article <vv2mqb$hem$1@reader1.panix.com>

Deutsch English Français Italiano
<vv2mqb$hem$1@reader1.panix.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail
From: cross@spitfire.i.gajendra.net (Dan Cross)
Newsgroups: comp.arch
Subject: Re: DMA is obsolete
Date: Fri, 2 May 2025 15:02:35 -0000 (UTC)
Organization: PANIX Public Access Internet and UNIX, NYC
Message-ID: <vv2mqb$hem$1@reader1.panix.com>
References: <vuj131$fnu$1@gal.iecc.com> <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org> <vv19rs$t2d$1@reader1.panix.com> <2025May2.073450@mips.complang.tuwien.ac.at>
Injection-Date: Fri, 2 May 2025 15:02:35 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80";
	logging-data="17878"; mail-complaints-to="abuse@panix.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: cross@spitfire.i.gajendra.net (Dan Cross)
Bytes: 10629
Lines: 189

In article <2025May2.073450@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>cross@spitfire.i.gajendra.net (Dan Cross) writes:
>>In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>,
>>>[snip]
>>>I suspect the 400 GHz NIC needs a rather BIG core to handle the
>>>traffic loads.
>
>Looking at
>https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a
>Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and
>write <18GB/s even to the L1 cache).  However, Chester Lam notes: "A53
>offers very low cache bandwidth compared to pretty much any other core
>we’ve analyzed."  I think, though, that a small in-order core like the
>A53, but with enough load and store buffering and enough bandwidth to
>I/O and the memory controller should not have a problem shoveling data
>from or to a 400Gb/s NIC.  With 128 bits/cycle in each direction one
>would need one transfer per cycle in each direction at 3125MHz to
>achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop
>overhead.  Given that the A53 typically only has 2GHz, supporting 256
>bits/cycle of transfer width (for load and store instructions, i.e.,
>along the lines of AVX-256) would be more appropriate.
>
>Going for an OoO core (something like AMD's Bobcat or Intel's
>Silvermont) would help achieve the bandwidth goals without excessive
>fine-tuning of the software.
>
>>>Having the remote core run the same OS code as every other core
>>>means the OS developers have fewer hoops to jump through. Bug-for
>>>bug compatibility means that clearing of those CRs just leaves
>>>the core out in the periphery idling and bothering no one.
>>
>>Eh...Having to jump through hoops here matters less to me for
>>this kind of use case than if I'm trying to use those cores for
>>general-purpose compute.
>
>I think it's the same thing as Greenspun's tenth rule: First you find
>that a classical DMA engine is too limiting, then you find that an A53
>is too limiting, and eventually you find that it would be practical to
>run the ISA of the main cores.  In particular, it allows you to use
>the toolchain of the main cores for developing them,

These are issues solveable with the software architecture and
build system for the host OS.   The important characteristic is
that the software coupling makes architectural sense, and that
simply does not require using the same ISA across IPs.

Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's
called these days is an ARM core while the big CPUs are x86.
I'm pretty sure there's an Xtensa DSP in there to do DRAM and
timing and PCIe link training.  Similarly with the ME on Intel.
A BMC might be running on whatever.  We increasingly see ARM
based SBCs that have small RISC-V microcontroller-class cores
embedded in the SoC for exactly this sort of thing.

At work, our service processor (granted, outside of the SoC but
tightly coupled at the board level) is a Cortex-M7, but we wrote
the OS for that, and we control the host OS that runs on x86,
so the SP and big CPUs can be mutually aware.  Our hardware RoT
is a smaller Cortex-M.  We don't have a BMC on our boards;
everything that it does is either done by the SP or built into
the host OS, both of which are measured by the RoT.

The problem is when such service cores are hidden (as they are
in the case of the PSP, SMU, MPIO, and similar components, to
use AMD as the example) and treated like black boxes by
software.  It's really cool that I can configure the IO crossbar
in useful way tailored to specific configurations, but it's much
less cool that I have to do what amounts to an RPC over the SMN
to some totally undocumented entity somewhere in the SoC to do
it.  Bluntly, as an OS person, I do not want random bits of code
running anywhere on my machine that I am not at least aware of
(yes, this includes firmware blobs on devices).

>and you can also
>use the facilities of the main cores (e.g., debugging features that
>may be absent of the I/O cores) during development.

This is interesting, but we've found it more useful going the
other way around.  We do most of our debugging via the SP.
Since The SP is also responsible for system initialization and
holding x86 in reset until we're reading for it to start
running, it's the obvious nexus for debugging the system
holistically.

I must admit that, since we design our own boards, so we have
options here that those buying from the consumer space or
traditional enterprise vendors don't, but that's one of the
considerable value-adds for hardware/software co-design.

>>Having a separate ISA means I cannot
>>accidentally run a program meant only for the big cores on the
>>IO service processors.
>
>Marking the binaries that should be able to run on the IO service
>processors with some flag, and letting the component of the OS that
>assigns processes to cores heed this flag is not rocket science.

I agree, that's easy.  And yet, mistakes will be made, and there
will be tension between wanting to dedicate those CPUs to IO
services and wanting to use them for GP programs: I can easily
imagine a paper where someone modifies a scheduler to move IO
bound programs to those cores.  Using a different ISA obviates
most of that, and provides an (admittedly modest) security benefit.

And if I already have to modify or configure the OS to
accommodate the existence of these things in the first place,
then accommodating an ISA difference really isn't that much
extra work.  The critical observation is that a typical SMP view
of the world no longer makes sense for the system architecture,
and trying to shoehorn that model onto the hardware reality is
just going to cause frustration.  Better to acknowledge that the

>You
>probably also don't want to run programs for the I/O processors on the
>main cores; whether you use a separate flag for indicating that, or
>whether one flag indicates both is an interesting question.
>
>>>On the other hand, you buy a motherboard with said ASIC core,
>>>and you can boot the MB without putting a big chip in the
>>>socket--but you may have to deal with scant DRAM since the
>>>big centralized chip contains teh memory controller.
>>
>>A neat hack for bragging rights, but not terribly practical?
>
>Very practical for updating the firmware of the board to support the
>big chip you want to put in the socket (called "BIOS FlashBack" in
>connection with AMD big chips).

"BIOS", as loaded from the EFS by the ABL on the PSP on EPYC
class chips, is usually stored in a QSPI flash on the main
board (though starting with Turin you _can_ boot via eSPI).
Strictly speaking, you don't _need_ an x86 core to rewrite that.
On our machines, we do that from the SP, but we don't use AGESA
or UEFI: all of the platform enablement stuff done in PEI and
DXE we do directly in the host OS.

Also, on AMD machines, again considering EPYC, it's up to system
software running on x86 to direct either the SMU or MPIO to
configure DXIO and the rest of the fabric before PCIe link
training even begins (releasing PCIe from PERST is done by
either the SMU or MPIO, depending on the specific
microarchitecture).  Where are these cores, again?  If they're
close to the devices, are they in the root complex or on the far
side of a bridge?  Can they even talk to the rest of the board?

Also, since this is x86, there's the issue of starting them and
getting them to run useful software.  Usually on x86 it's the
responsibility of the BSC to start APs (AGESA usually does CCX
initialization and starts all the threads and do APIC ID
assignment and so on, but then directs them to park and wait for
the OS to do the usual INIT/SIPI/SIPI dance); but if the BSC is
absent because the socket is unpopulated, what starts them?  And
what software are they running?  Again, it's not even clear that
they have access to QSPI to boot into e.g. AGESA; if they've got
some little local ROM or flash or something, then how does the
OS get control them?

Perhaps there's some kind of electrical interlock that brings
them up if the socket is empty, but one must answer the question
of what's responsbile before assuming you can use them, and it
seems like the _best_ course of action would be to leave them
in reset (or even powered off) until explicitly enabled by
software, probably via a write to some magic capability in
config space on a bridge (like how one interacts with SMN now).

>In a case where we did not have that
>feature, and the board did not support the CPU, we had to buy another
>CPU to update the firmware
><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>.  That's
>especially relevant for AM4 boards, because the support chips make it
>hard to use more than 16MB Flash for firmware, but the firmware for
>all supported big chips does not fit into 16MB.  However, as the case
>mentioned above shows, it's also relevant for Intel boards.

You shouldn't need to boot the host operating system to do that,
though I get on most consumer-grade machines you'll do it via
something that interfaces with AGESA or UEFI.  Most server-grade
machines will have a BMC that can do this independently of the
main CPU, and I should be clear that I'm discounting use cases
for consumer grade boards, where I suspect something like this
is less interesting than on server hardware.

As I mentioned, we build our own boards, and this just isn't an
========== REMAINDER OF ARTICLE TRUNCATED ==========