Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!feeder3.eternal-september.org!news.quux.org!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: MSI interrupts Date: Fri, 14 Mar 2025 18:12:23 +0000 Organization: Rocksolid Light Message-ID: References: <3d5200797dd507ae051195e0b2d8ff56@www.novabbs.org> <6731f278e3a9eb70d34250f43b7f15f2@www.novabbs.org> <748c0cc0ba18704b4678fd553193573e@www.novabbs.org> <2YJAP.403746$zz8b.238811@fx09.iad> <53b8227eba214e0340cad309241af7b5@www.novabbs.org> <3pXAP.584096$FVcd.26370@fx10.iad> <795b541375e3e0f53e2c76a55ffe3f20@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="184115"; mail-complaints-to="usenet@i2pn2.org"; posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU"; User-Agent: Rocksolid Light X-Rslight-Site: $2y$10$k/mGr1W3NlbbdEDaZ6FUpuX4Uy9AP4wD5Ttb.yTpXd9Mm3x08bEIK X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71 Bytes: 6927 Lines: 129 On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote: > mitchalsup@aol.com (MitchAlsup1) writes: >>On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote: >> > >> >>We CPU guys deal with dozens of cores, each having 2×64KB L1 caches, >>a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3 >>cache. This means the chip contains 26,624 1KB SRAM macros. > > You've a lot more area to work with, and generally a more > recent process node. > > >> >>Was thinking about this last night:: >>a) device goes up and reads DRAM via L3::MC and DRC >>b) DRAM data is delivered to device 15ns later > > 15ns? That's optimistic and presumes a cache hit, right? See the paragraph below {to hand waving accuracy} for a more reasonable guestimate of 122ns from device back to device just reading the MSI-X message and address. > Don't forget to factor in PCIe latency (bus to RC and RC to endpoint). > >>c) device uses data to send MSI-X message to interrupt 'controller' >>d) interrupt controller in L3 sees interrupt >> >>{to hand waving accuracy} >>So, we have dozen ns up the PCIe tree, dozen ns over the interconnect, >>50ns in DRAM, dozens ns over the interconnect, dozens of ns down the >>PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across >>interconnect, arriving at interrupt service port after 122 ns or >>about the equivalent of 600± clocks to log the interrupt into the >>table. >> >>The Priority broadcast is going to take another dozen ns, core >>request for interrupt will be another dozen to service controller, >>even if the service port request is serviced instantaneously, >>the MSI-X message does not arrive at core until 72ns after arriving >>at service port--for a best case latency on the order of 200 ns >>(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.) >> >>And that is under the assumption that no traffic interference >>is encountered up or down the PCIe trees. >> >>whereas:: >> >>if the device DRAM read request was known to contain an MSI-X >>message, > > You can't know that a priori, yes, I know that:: but if you c o u l d . . . you could save roughly 1/2 of the interrupt delivery to core latency. > it's just another memory write > (or read if you need to fetch the address and data from DRAM) > TLP as part of the inbound DMA. Which needs to hit the IOMMU > first to translate the PCI memory space address to the host > physical address space address. > > If the MSI-X tables were kept in DRAM, you also need to include > the IOMMU translation latency in the inbound path that fetches > the vector address and vector data (96-bits, so that's two > round trips from the device to memory). For a virtual function, > the MSI-X table is owned and managed by the guest, and all > transaction addresses from the device must be translated from > guest physical addresses to host physical addresses. > > A miss in the IOMMU adds a _lot_ of latency to the request. > > So, that's three round trips from the device to the > Uncore/RoC just to send a single interrupt from the device. 3 dozen-ns traversals, not counting actual memory access time. Then another dozen ns traversal and enQueueing in the interrupt table. Then 3 round dozen-ns trips on the on-die interconnect. It all adds up. > >>> The latency overhead of fetching the vector from DRAM is >>> prohibitive for high-speed devices such as network controllers. >> >>Here we have the situation where one can context switch in a lower >>number of clock cycles than one can deliver an interrupt from >>a device to a servicing core. > > Device needs to send an interrupt when vectors stored in host DRAM > instead of internal SRAM or flops: > > - send non-posted MRD TLP to RC to fetch MSI-X address > - receiver (pcie controller (RC), for example) passes > MRD address to IOMMU for translation (assuming > the device and host don't implement ATS), > IOMMU translates (table walk latency) the > address from the TLP to a host physical > address (which could involve two levels of > translation, so up to 22 DRAM accesses (intel/amd/aarch64) > on IOMMU TLB miss). The latency is dependent > up on the IOMMU table format - Intel has EPT > while ARM and AMD use the same format as the CPU > page tables for the IOMMU tables. > (this leaves out any further latency hit when > using the PCI Page Request Interface (PRI) to make > the target page resident). > - LLC/DRAM satisfies the MRD and returns data to > PCIe controller, which sends a completion TLP > to device. LLC (minimum), DRAM (maximum) latency added. > - RC/host sends response with address to device > - Device sends non-posted MRD TLP to RC to fetch MSI-X Data > (32-bit). Again with the IOMMU, but will likely > hit TLB. Lesser latency than a miss, but nonzero. > - RC returns completion TLP to device. > - Device sends MWR TLP (data payload) with the translated > address to the root complex, which passes it to the > internal bus structure for routing to the final address > (interrupt controller). > - Then add the latency from the interrupt controller to > the target core (which may include making the target guest > resident). > > That's a whole pile of latency to send an interrupt. I bet the MSI-X messages would cache on the device rather well ... as they change roughly at the rate of VM creation.