Path: ...!weretis.net!feeder9.news.weretis.net!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail From: cross@spitfire.i.gajendra.net (Dan Cross) Newsgroups: comp.arch Subject: Re: DMA is obsolete Date: Fri, 2 May 2025 02:15:24 -0000 (UTC) Organization: PANIX Public Access Internet and UNIX, NYC Message-ID: References: <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org> Injection-Date: Fri, 2 May 2025 02:15:24 -0000 (UTC) Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80"; logging-data="29773"; mail-complaints-to="abuse@panix.com" X-Newsreader: trn 4.0-test77 (Sep 1, 2010) Originator: cross@spitfire.i.gajendra.net (Dan Cross) Bytes: 6546 Lines: 120 In article <5a77c46910dd2100886ce6fc44c4c460@www.novabbs.org>, MitchAlsup1 wrote: >On Thu, 1 May 2025 13:07:07 +0000, Dan Cross wrote: >> In article , >> MitchAlsup1 wrote: >>>On Sat, 26 Apr 2025 17:29:06 +0000, Scott Lurndal wrote: >>>[snip] >>>Reminds me of trying to sell a micro x86-64 to AMD as a project. >>>The ยต86 is a small x86-64 core made available as IP in Verilog >>>where it has/runs the same ISA as main GBOoO x86, but is placed >>>"out in the PCIe" interconnect--performing I/O services topo- >>>logically adjacent to the device itself. This allows 1ns access >>>latencies to DCRs and performing OS queueing of DPCs,... without >>>bothering the GBOoO cores. >>> >>>AMD didn't buy the arguments. >> >> I can see it either way; I suppose the argument as to whether I >> buy it or not comes down to, "in depends". How much control do >> I, as the OS implementer, have over this core? > >Other than it being placed "away" from the centralized cores, >it runs the same ISA as the main cores has longer latency to >coherent memory and shorter latency to device control registers >--which is why it is placed close to the device itself:: latency. >The big fast centralized core is going to get microsecond latency >from MMI/O device whereas ASIC version will have handful of nano- >second latencies. So the 5 GHZ core sees ~1 microsecond while the >little ASIC sees 10 nanoseconds. ... Yes, I get the argument for WHY you'd do it, I just want to make sure that it's an ordinary core (albeit one that is far away from the sockets with the main SoC complexes) that I interact with in the usual manner. Compare to, say, MP1 or MP0 on AMD Zen, where it runs its own (proprietary) firmware that I interact with via an RPC protocol over an AXI bus, if I interact with it at all: most OEMs just punt and run AGESA (we don't). >> If it is yet another hidden core embedded somewhere deep in the >> SoC complex and I can't easily interact with it from the OS, >> then no thanks: we've got enough of those between MP0, MP1, MP5, >> etc, etc. >> >> On the other hand, if it's got a "normal" APIC ID, the OS has >> control over it like any other LP, and its coherent with the big >> cores, then yeah, sign me up: I've been wanting something like >> that for a long time now. > >It is just a core that is cheap enough to put in ASICs, that >can offload some I/O burden without you having to do anything >other than setting some bits in some CRs so interrupts are >routed to this core rather than some more centralized core. Sounds good. >> Consider a virtualization application. A problem with, say, >> SR-IOV is that very often the hypervisor wants to interpose some >> sort of administrative policy between the virtual function and >> whatever it actually corresponds to, but get out of the fast >> path for most IO. This implies a kind of offload architecture >> where there's some (presumably software) agent dedicated to >> handling IO that can be parameterized with such a policy. A > >Interesting:: Could you cite any literature, here !?! Sure. This paper is a bit older, but gets at the main points: https://www.usenix.org/system/files/conference/nsdi18/nsdi18-firestone.pdf I don't know if the details are public for similar technologies from Amazon or Google. >> core very close to the device could handle that swimmingly, >> though I'm not sure it would be enough to do it at (say) line >> rate for a 400Gbps NIC or Gen5 NVMe device. > >I suspect the 400 GHz NIC needs a rather BIG core to handle the >traffic loads. Indeed. Part of the challenge for the hyperscalars is in meeting that demand while not burning too many host resources, which are the thing they're actually selling their customer in the first place. A lot of folks are pushing this off to the NIC itself, and I've seen at least one team that implemented NVMe in firmware on a 100Gbps NIC, exposed via SR-IOV, as part of a disaggregated storage architecture. Another option is to push this to the switch; things like Intel Tofino2 were well-position for this, but of course Intel, in its infinite wisdom and vision, canc'ed Tofino. >> ....but why x86_64? It strikes me that as long as the _data_ >> formats vis the software-visible ABI are the same, it doesn't >> need to use the same ISA. In fact, I can see advantages to not >> doing so. > >Having the remote core run the same OS code as every other core >means the OS developers have fewer hoops to jump through. Bug-for >bug compatibility means that clearing of those CRs just leaves >the core out in the periphery idling and bothering no one. Eh...Having to jump through hoops here matters less to me for this kind of use case than if I'm trying to use those cores for general-purpose compute. Having a separate ISA means I cannot accidentally run a program meant only for the big cores on the IO service processors. As long as the OS has total control over the execution of the core, and it participates in whatever cache coherency scheme the rest of the system uses, then the ISA just isn't that important. >On the other hand, you buy a motherboard with said ASIC core, >and you can boot the MB without putting a big chip in the >socket--but you may have to deal with scant DRAM since the >big centralized chip contains teh memory controller. A neat hack for bragging rights, but not terribly practical? Anyway, it's a neat idea. It's very reminiscent of IBM channel controllers, in a way. - Dan C.