Article <vep8rb$2d8ru$1@dont-email.me>

Deutsch English Français Italiano
<vep8rb$2d8ru$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "Paul A. Clayton" <paaronclayton@gmail.com>
Newsgroups: comp.arch
Subject: Re: MM instruction and the pipeline
Date: Wed, 16 Oct 2024 16:48:39 -0400
Organization: A noiseless patient Spider
Lines: 120
Message-ID: <vep8rb$2d8ru$1@dont-email.me>
References: <venkii$23b6b$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 16 Oct 2024 22:48:44 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0085cf96d908ff9def3e765ffce47f4c";
	logging-data="2532222"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18Y7/Pid1khN2CIhx2LnqGnuAdext3XvHE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.0
Cancel-Lock: sha1:2PegBdGH/mHmE3aCzK5Igbv4XQA=
In-Reply-To: <venkii$23b6b$1@dont-email.me>
Bytes: 6338

On 10/16/24 1:56 AM, Stephen Fuld wrote:
> Even though this is about the MM instruction, and the MM 
> instruction is mentioned in other threads, they have lots of other 
> stuff (thread drift), and this isn't related to C, standard or 
> otherwise, so I thought it best to start a new thread,
> 
> My questions are about what happens to subsequent instructions 
> that immediately follow the MM in the stream when an MM 
> instruction is executing.  Since an MM instruction may take quite 
> a long time (in computer time) to complete I think it is useful to 
> know what else can happen while the MM is executing.

This would seem to be very implementation dependent.
Architecturally, no following instructions can execute until after
the MM completes. With respect to microarchitecture, an arbitrary
amount of parallelism could be provided.


> I will phrase this as a series of questions.

While Mitch Alsup can answer these more authoritatively, I will
take a stab at them.

> 1.    I assume that subsequent non-memory reference instructions 
> can proceed simultaneously with the MM.  Is that correct?

This would probably be true even for the in-order scalar
implementation.
> 2.    Can a load or store where the memory address is in neither 
> the source nor the destination of the MM proceed simultaneously 
> with the MM

This is a little more complicated than just marking a register as
not-ready (for a load destination), so might not be supported in
a simple implementation. Memory accesses would have to check both
ranges rather than just one of 32 register names or eight store
buffer entries.

Mitch Alsup's description of the small quasi-scalar core implies
to me that the MM instruction would occupy the memory access
interface until it is finished.

I would guess that any out-of-order implementation would support
loads and stores outside of the MM regions to proceed
speculatively until the various OoO buffering structures are
filled.

> 3.    Can a load where the memory address is within the source of 
> the MM proceed?

My guess would be that any OoO implementation would support this.
If the implementation checks for a hit in both ranges, it would
seem to be little extra effort to allow a load to a 'clean'
address to proceed.

Supporting this and preventing reads of the destination and all
stores would only require one address range check; loads can
proceed as long as they are not within the destination.

> For the next questions, assume for exposition that the MM has 
> proceeded to complete 1/3 of the move when the following 
> instructions come up.
> 
> 4.    Can a load in the first third of the destination range proceed?

I would guess that an out-of-order implementation would forward
data from all stores performed speculatively by the MM (limited by
the store queue). MM stores that are no longer speculative — where
an interrupt would place the count — would seem to be naturally
handled as if singular committed stores, i.e., following
instructions could speculatively execute using those values.

> 5.    Can a store in the first third of the source range proceed?

In the non-speculative region of the MM, speculative stores could 
"execute", storing to the store queue. These stores would be 
squashed if the MM does not fully complete along with all other 
instructions after the MM. The MM is synchronous.

A large MM that is no longer speculative might be implemented as
avoiding the store queue to allow more stores after the MM to be
speculated. For very large MMs, a copy engine farther from the
core might be used.

> 6.    Can a store in the first third of the destination range 
> proceed?

Since the MM has architecturally completed to roughly that point 
(some stores might only have "completed" to the store queue), it
would not be difficult to support speculative stores in the
completed range for an out-of-order implementation. These stores
would be rolled back if the MM does not fully complete and commit.


Here is a question that I will leave to Mitch:

Can a MM that has confirmed permissions commit before it has been
performed such that uncorrectable errors would be recognized not
on read of the source but on later read of the destination?

I could see some wanting to depend on the copy checking data
validity synchronously, but some might be okay with a quasi-
synchronous copy that allows the processor to continue doing work
outside of the MM.

If a translation map is provided for coherence, any MM could
commit once it is not speculative but before the actual copy has
been performed. Tracking what parts have been completed in the
presence of other stores would have significant overhead.

For page-aligned copies, a copy-on-write mechanism might be used.

There are also cache designs which support deduplication; cache 
block aligned copies might be faster than physical copying. With 
lossy/truncated cache compression, unaligned fragments might be
deduplicated (and read-for-ownership might be avoided similar to
having fine-grained valid bits).

I rather suspect that what is physically possible is far broader 
than what is possible with a finite engineering budget.