Article <aceeec2839b8824d52f0cbe709af51e1@www.novabbs.org>

Deutsch English Français Italiano
<aceeec2839b8824d52f0cbe709af51e1@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.quux.org!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: MSI interrupts
Date: Fri, 14 Mar 2025 18:12:23 +0000
Organization: Rocksolid Light
Message-ID: <aceeec2839b8824d52f0cbe709af51e1@www.novabbs.org>
References: <vqto79$335c6$1@dont-email.me> <3d5200797dd507ae051195e0b2d8ff56@www.novabbs.org> <YyFAP.729659$eNx6.235106@fx14.iad> <6731f278e3a9eb70d34250f43b7f15f2@www.novabbs.org> <AUHAP.61125$Xq5f.14972@fx38.iad> <748c0cc0ba18704b4678fd553193573e@www.novabbs.org> <2YJAP.403746$zz8b.238811@fx09.iad> <53b8227eba214e0340cad309241af7b5@www.novabbs.org> <3pXAP.584096$FVcd.26370@fx10.iad> <795b541375e3e0f53e2c76a55ffe3f20@www.novabbs.org> <vNZAP.37553$D_V4.18229@fx39.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="184115"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU";
User-Agent: Rocksolid Light
X-Rslight-Site: $2y$10$k/mGr1W3NlbbdEDaZ6FUpuX4Uy9AP4wD5Ttb.yTpXd9Mm3x08bEIK
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71
Bytes: 6927
Lines: 129

On Fri, 14 Mar 2025 17:35:23 +0000, Scott Lurndal wrote:

> mitchalsup@aol.com (MitchAlsup1) writes:
>>On Fri, 14 Mar 2025 14:52:47 +0000, Scott Lurndal wrote:
>>
>
>>
>>We CPU guys deal with dozens of cores, each having 2×64KB L1 caches,
>>a 256KB-1024KB L2 cache, and have that dozen cores share a 16MB L3
>>cache. This means the chip contains 26,624 1KB SRAM macros.
>
> You've a lot more area to work with, and generally a more
> recent process node.
>
>
>>
>>Was thinking about this last night::
>>a) device goes up and reads DRAM via L3::MC and DRC
>>b) DRAM data is delivered to device 15ns later
>
> 15ns?   That's optimistic and presumes a cache hit, right?

See the paragraph below {to hand waving accuracy} for a more
reasonable guestimate of 122ns from device back to device
just reading the MSI-X message and address.

> Don't forget to factor in PCIe latency (bus to RC and RC to endpoint).
>
>>c) device uses data to send MSI-X message to interrupt 'controller'
>>d) interrupt controller in L3 sees interrupt
>>
>>{to hand waving accuracy}
>>So, we have dozen ns up the PCIe tree, dozen ns over the interconnect,
>>50ns in DRAM, dozens ns over the interconnect, dozens of ns down the
>>PCIe tree, 1ns at device, dozen ns up the PCIe tree, dozens across
>>interconnect, arriving at interrupt service port after 122 ns or
>>about the equivalent of 600± clocks to log the interrupt into the
>>table.
>>
>>The Priority broadcast is going to take another dozen ns, core
>>request for interrupt will be another dozen to service controller,
>>even if the service port request is serviced instantaneously,
>>the MSI-X message does not arrive at core until 72ns after arriving
>>at service port--for a best case latency on the order of 200 ns
>>(or 1000 CPU cycles or ~ 2,000 instructions worth of execution.)
>>
>>And that is under the assumption that no traffic interference
>>is encountered up or down the PCIe trees.
>>
>>whereas::
>>
>>if the device DRAM read request was known to contain an MSI-X
>>message,
>
> You can't know that a priori,

yes, I know that:: but if you c o u l d . . . you could save roughly
1/2 of the interrupt delivery to core latency.

>                               it's just another memory write
> (or read if you need to fetch the address and data from DRAM)
> TLP as part of the inbound DMA.   Which needs to hit the IOMMU
> first to translate the PCI memory space address to the host
> physical address space address.
>
> If the MSI-X tables were kept in DRAM, you also need to include
> the IOMMU translation latency in the inbound path that fetches
> the vector address and vector data (96-bits, so that's two
> round trips from the device to memory).  For a virtual function,
> the MSI-X table is owned and managed by the guest, and all
> transaction addresses from the device must be translated from
> guest physical addresses to host physical addresses.
>
> A miss in the IOMMU adds a _lot_ of latency to the request.
>
> So, that's three round trips from the device to the
> Uncore/RoC just to send a single interrupt from the device.

3 dozen-ns traversals, not counting actual memory access time.
Then another dozen ns traversal and enQueueing in the interrupt
table. Then 3 round dozen-ns trips on the on-die interconnect.

It all adds up.

>
>>> The latency overhead of fetching the vector from DRAM is
>>> prohibitive for high-speed devices such as network controllers.
>>
>>Here we have the situation where one can context switch in a lower
>>number of clock cycles than one can deliver an interrupt from
>>a device to a servicing core.
>
> Device needs to send an interrupt when vectors stored in host DRAM
> instead of internal SRAM or flops:
>
>   - send non-posted MRD TLP to RC to fetch MSI-X address
>   - receiver (pcie controller (RC), for example) passes
>     MRD address to IOMMU for translation (assuming
>     the device and host don't implement ATS),
>     IOMMU translates (table walk latency) the
>     address from the TLP to a host physical
>     address (which could involve two levels of
>     translation, so up to 22 DRAM accesses  (intel/amd/aarch64)
>     on IOMMU TLB miss).  The latency is dependent
>     up on the IOMMU table format - Intel has EPT
>     while ARM and AMD use the same format as the CPU
>     page tables for the IOMMU tables.
>     (this leaves out any further latency hit when
>      using the PCI Page Request Interface (PRI) to make
>      the target page resident).
>   - LLC/DRAM satisfies the MRD and returns data to
>     PCIe controller, which sends a completion TLP
>     to device.  LLC (minimum), DRAM (maximum) latency added.
>   - RC/host sends response with address to device
>   - Device sends non-posted MRD TLP to RC to fetch MSI-X Data
>     (32-bit). Again with the IOMMU, but will likely
>     hit TLB.  Lesser latency than a miss, but nonzero.
>   - RC returns completion TLP to device.
>   - Device sends MWR TLP (data payload) with the translated
>     address to the root complex, which passes it to the
>     internal bus structure for routing to the final address
>     (interrupt controller).
>   - Then add the latency from the interrupt controller to
>     the target core (which may include making the target guest
>     resident).
>
> That's a whole pile of latency to send an interrupt.

I bet the MSI-X messages would cache on the device rather well ...
as they change roughly at the rate of VM creation.