Deutsch English Français Italiano |
<vv2mqb$hem$1@reader1.panix.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail From: cross@spitfire.i.gajendra.net (Dan Cross) Newsgroups: comp.arch Subject: Re: DMA is obsolete Date: Fri, 2 May 2025 15:02:35 -0000 (UTC) Organization: PANIX Public Access Internet and UNIX, NYC Message-ID: <vv2mqb$hem$1@reader1.panix.com> References: <vuj131$fnu$1@gal.iecc.com> <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org> <vv19rs$t2d$1@reader1.panix.com> <2025May2.073450@mips.complang.tuwien.ac.at> Injection-Date: Fri, 2 May 2025 15:02:35 -0000 (UTC) Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80"; logging-data="17878"; mail-complaints-to="abuse@panix.com" X-Newsreader: trn 4.0-test77 (Sep 1, 2010) Originator: cross@spitfire.i.gajendra.net (Dan Cross) Bytes: 10629 Lines: 189 In article <2025May2.073450@mips.complang.tuwien.ac.at>, Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote: >cross@spitfire.i.gajendra.net (Dan Cross) writes: >>In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>, >>>[snip] >>>I suspect the 400 GHz NIC needs a rather BIG core to handle the >>>traffic loads. > >Looking at >https://chipsandcheese.com/p/arms-cortex-a53-tiny-but-important, a >Cortex-A53 would not be up to it (at 1896MHz it can read <12GB/s and >write <18GB/s even to the L1 cache). However, Chester Lam notes: "A53 >offers very low cache bandwidth compared to pretty much any other core >we’ve analyzed." I think, though, that a small in-order core like the >A53, but with enough load and store buffering and enough bandwidth to >I/O and the memory controller should not have a problem shoveling data >from or to a 400Gb/s NIC. With 128 bits/cycle in each direction one >would need one transfer per cycle in each direction at 3125MHz to >achieve 400Gb/s, or maybe 4GHz for a dual-issue core to allow for loop >overhead. Given that the A53 typically only has 2GHz, supporting 256 >bits/cycle of transfer width (for load and store instructions, i.e., >along the lines of AVX-256) would be more appropriate. > >Going for an OoO core (something like AMD's Bobcat or Intel's >Silvermont) would help achieve the bandwidth goals without excessive >fine-tuning of the software. > >>>Having the remote core run the same OS code as every other core >>>means the OS developers have fewer hoops to jump through. Bug-for >>>bug compatibility means that clearing of those CRs just leaves >>>the core out in the periphery idling and bothering no one. >> >>Eh...Having to jump through hoops here matters less to me for >>this kind of use case than if I'm trying to use those cores for >>general-purpose compute. > >I think it's the same thing as Greenspun's tenth rule: First you find >that a classical DMA engine is too limiting, then you find that an A53 >is too limiting, and eventually you find that it would be practical to >run the ISA of the main cores. In particular, it allows you to use >the toolchain of the main cores for developing them, These are issues solveable with the software architecture and build system for the host OS. The important characteristic is that the software coupling makes architectural sense, and that simply does not require using the same ISA across IPs. Indeed, consider AMD's Zen CPUs; the PSP/ASP/whatever it's called these days is an ARM core while the big CPUs are x86. I'm pretty sure there's an Xtensa DSP in there to do DRAM and timing and PCIe link training. Similarly with the ME on Intel. A BMC might be running on whatever. We increasingly see ARM based SBCs that have small RISC-V microcontroller-class cores embedded in the SoC for exactly this sort of thing. At work, our service processor (granted, outside of the SoC but tightly coupled at the board level) is a Cortex-M7, but we wrote the OS for that, and we control the host OS that runs on x86, so the SP and big CPUs can be mutually aware. Our hardware RoT is a smaller Cortex-M. We don't have a BMC on our boards; everything that it does is either done by the SP or built into the host OS, both of which are measured by the RoT. The problem is when such service cores are hidden (as they are in the case of the PSP, SMU, MPIO, and similar components, to use AMD as the example) and treated like black boxes by software. It's really cool that I can configure the IO crossbar in useful way tailored to specific configurations, but it's much less cool that I have to do what amounts to an RPC over the SMN to some totally undocumented entity somewhere in the SoC to do it. Bluntly, as an OS person, I do not want random bits of code running anywhere on my machine that I am not at least aware of (yes, this includes firmware blobs on devices). >and you can also >use the facilities of the main cores (e.g., debugging features that >may be absent of the I/O cores) during development. This is interesting, but we've found it more useful going the other way around. We do most of our debugging via the SP. Since The SP is also responsible for system initialization and holding x86 in reset until we're reading for it to start running, it's the obvious nexus for debugging the system holistically. I must admit that, since we design our own boards, so we have options here that those buying from the consumer space or traditional enterprise vendors don't, but that's one of the considerable value-adds for hardware/software co-design. >>Having a separate ISA means I cannot >>accidentally run a program meant only for the big cores on the >>IO service processors. > >Marking the binaries that should be able to run on the IO service >processors with some flag, and letting the component of the OS that >assigns processes to cores heed this flag is not rocket science. I agree, that's easy. And yet, mistakes will be made, and there will be tension between wanting to dedicate those CPUs to IO services and wanting to use them for GP programs: I can easily imagine a paper where someone modifies a scheduler to move IO bound programs to those cores. Using a different ISA obviates most of that, and provides an (admittedly modest) security benefit. And if I already have to modify or configure the OS to accommodate the existence of these things in the first place, then accommodating an ISA difference really isn't that much extra work. The critical observation is that a typical SMP view of the world no longer makes sense for the system architecture, and trying to shoehorn that model onto the hardware reality is just going to cause frustration. Better to acknowledge that the >You >probably also don't want to run programs for the I/O processors on the >main cores; whether you use a separate flag for indicating that, or >whether one flag indicates both is an interesting question. > >>>On the other hand, you buy a motherboard with said ASIC core, >>>and you can boot the MB without putting a big chip in the >>>socket--but you may have to deal with scant DRAM since the >>>big centralized chip contains teh memory controller. >> >>A neat hack for bragging rights, but not terribly practical? > >Very practical for updating the firmware of the board to support the >big chip you want to put in the socket (called "BIOS FlashBack" in >connection with AMD big chips). "BIOS", as loaded from the EFS by the ABL on the PSP on EPYC class chips, is usually stored in a QSPI flash on the main board (though starting with Turin you _can_ boot via eSPI). Strictly speaking, you don't _need_ an x86 core to rewrite that. On our machines, we do that from the SP, but we don't use AGESA or UEFI: all of the platform enablement stuff done in PEI and DXE we do directly in the host OS. Also, on AMD machines, again considering EPYC, it's up to system software running on x86 to direct either the SMU or MPIO to configure DXIO and the rest of the fabric before PCIe link training even begins (releasing PCIe from PERST is done by either the SMU or MPIO, depending on the specific microarchitecture). Where are these cores, again? If they're close to the devices, are they in the root complex or on the far side of a bridge? Can they even talk to the rest of the board? Also, since this is x86, there's the issue of starting them and getting them to run useful software. Usually on x86 it's the responsibility of the BSC to start APs (AGESA usually does CCX initialization and starts all the threads and do APIC ID assignment and so on, but then directs them to park and wait for the OS to do the usual INIT/SIPI/SIPI dance); but if the BSC is absent because the socket is unpopulated, what starts them? And what software are they running? Again, it's not even clear that they have access to QSPI to boot into e.g. AGESA; if they've got some little local ROM or flash or something, then how does the OS get control them? Perhaps there's some kind of electrical interlock that brings them up if the socket is empty, but one must answer the question of what's responsbile before assuming you can use them, and it seems like the _best_ course of action would be to leave them in reset (or even powered off) until explicitly enabled by software, probably via a write to some magic capability in config space on a bridge (like how one interacts with SMN now). >In a case where we did not have that >feature, and the board did not support the CPU, we had to buy another >CPU to update the firmware ><https://www.complang.tuwien.ac.at/anton/asus-p10s-c4l.html>. That's >especially relevant for AM4 boards, because the support chips make it >hard to use more than 16MB Flash for firmware, but the firmware for >all supported big chips does not fit into 16MB. However, as the case >mentioned above shows, it's also relevant for Intel boards. You shouldn't need to boot the host operating system to do that, though I get on most consumer-grade machines you'll do it via something that interfaces with AGESA or UEFI. Most server-grade machines will have a BMC that can do this independently of the main CPU, and I should be clear that I'm discounting use cases for consumer grade boards, where I suspect something like this is less interesting than on server hardware. As I mentioned, we build our own boards, and this just isn't an ========== REMAINDER OF ARTICLE TRUNCATED ==========