Article <ef12021b16a514c71a5cab2f0efa60c7@www.novabbs.org>

Deutsch English Français Italiano
<ef12021b16a514c71a5cab2f0efa60c7@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: MSI interrupts
Date: Mon, 17 Mar 2025 21:49:18 +0000
Organization: Rocksolid Light
Message-ID: <ef12021b16a514c71a5cab2f0efa60c7@www.novabbs.org>
References: <vqto79$335c6$1@dont-email.me> <53b8227eba214e0340cad309241af7b5@www.novabbs.org> <3pXAP.584096$FVcd.26370@fx10.iad> <795b541375e3e0f53e2c76a55ffe3f20@www.novabbs.org> <vNZAP.37553$D_V4.18229@fx39.iad> <aceeec2839b8824d52f0cbe709af51e1@www.novabbs.org> <eM_AP.81303$8rz3.7843@fx37.iad> <vr2nj9$2goqe$1@dont-email.me> <f2cb846242dbfcef1efa59b92763a965@www.novabbs.org> <vr4ovm$9fl5$1@dont-email.me> <1681197d3c1af131d6b8cae884f7c9ca@www.novabbs.org> <vr7g76$2jnqm$1@dont-email.me> <8BVBP.816276$eNx6.247046@fx14.iad> <20250317161132.00004dd9@yahoo.com> <1WZBP.558392$SZca.243157@fx13.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="626494"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$D/AYiUH4s3BFn3j8Ct4WleEL2qh6Hf0kScODiSUXMdE2u5UECLALi
X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71
Bytes: 6394
Lines: 105

On Mon, 17 Mar 2025 18:33:09 +0000, EricP wrote:

> Michael S wrote:
>> On Mon, 17 Mar 2025 13:38:12 GMT
>> scott@slp53.sl.home (Scott Lurndal) wrote:
------------------
>>
>> The problem Robert is talking about arises when there are many
>> interrupt source and many target CPUs.
>> The required routing/prioritization/acknowledgment logic (at least a
>> naive logic I am having in mind) would be either non-scalable or
>> relatively complicated. Process of selection for the second case will
>> take multiple cycles (I am thinking about ring).
>
> Another problem is what does the core do with the in flight
> instructions.
>
> Method 1 is simplest, it injects the interrupt request at Retire
> as that's where the state of everything is synchronized.
> The consequence is that, like exceptions, the in flight instructions all
> get purged, and we save the committed RIP, RSP and interrupt control
> word.
> While that might be acceptable for a 5 stage in-order pipeline,
> it could be pretty expensive for an OoO 200+ instruction queue
> potentially tossing hundreds of cycles of near finished work.

Lowest interrupt Latency
Highest waste of power (i.e., work)

> Method 2 pipelines the switch by injecting the interrupt request at
> Fetch.
> Decode converts the request to a special uOp that travels down the IQ
> to Retire and allows all the older work to complete.
> This is more complex as it requires a two phase hand-off from the
> Interrupt Control Unit (ICU) to the core as a branch mispredict in the
> in flight instructions might cause a tentative interrupt acceptance to
> later be withdrawn.

Interrupt latency is dependent on executing instructions,
Lowest waste of power

But note: In most cases, it already took the interrupt ~150 nanoseconds
to arrive at the interrupt service port. 1 trip from device to DRAM
(possibly serviced by L3), 1 trip from DRAM back to device, 1 tip from
device to interrupt service port; and 4 DRAM (or L3) accesses to log
interrupt into table.

Also, in most cases, the 200-odd instructions in the window will finish
in 100-cycles or as little as 20ns--but if the FDIV unit is saturated,
interrupt latency could be as high as 640 cycles and as long as 640ns.

> The ICU believes the core is in a state to accept a higher priority
> interrupt. It sends a request to core, which checks its current state
> and
> sends back an immediate INT_ACK if _might_ accept and stalls Fetch, or a
> NAK.

In My 66000, ICU knows nothing about the priority level (or state)
of any core in the system. Instead, when a new higher priority
interrupt is raised, the ISP broadcasts a 64-bit mask indicating
which priority levels in the interrupt table have pending inter-
rupts with an MMI/O message to the address of the interrupt table.

All cores monitoring that interrupt table capture the broadcast,
and each core decides to negotiate for an (not that) interrupt
by requesting the highest priority interrupt from the table.

When the request returns, and it is still at a higher priority
than the core is running, core performs interrupt control transfer.
If the interrupt is below the core's priority it is returned to
ISP as if NAKed.

Prior to interrupt control transfer, core remains running what-
ever it was running--and all the interrupt stuff is done by state
machines at the edge of the core and the L3/DRAM controller.

> When the special uOp reaches Retire, it sends a signal to Fetch which
> then sends an INT_ACCEPT signal to ICU to complete the handoff.
> If a branch mispredict occurs that causes interrupts to be disabled,
> then Fetch sends an INT_REJECT to ICU, and unstalls its fetching.
> (Yes that is not optimal - make it work first, make it work well
> second.)
>
> This also raises a question about what the ICU is doing during this
> long latency handoff. One wouldn't want ICU to sit idle so it might
> have to manage the handoff of multiple interrupts to multiple cores
> at the same time, each as its own little state machine.

One must assume that ISP is capable of taking a new interrupt
from a device every 5-ish cycles and interrupt handoff is in the
range of 50 cycles, and that each interrupt could be to a different
interrupt table.

My 66000 ISP treats successive requests to any one table as strongly
ordered, and requests to different tables as completely unordered.

> One should see that this decision on how the core handles the
> handoff has a large impact on the design complexity of the ICU.

I did not "see" that in My 66000's interrupt architecture. The ISP
complexity is fixed, and the core's interrupt negotiator is a small
state machine (~10-states).

ISP essentially performs 4-5 64-bit memory accesses, and possibly
1 MMI/O 64-bit broadcast on arrival of MSI-X interrupt. Then if
a core negotiates, it performs 3 more memory accesses per negotiator.