Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Paul A. Clayton" Newsgroups: comp.arch Subject: Re: MM instruction and the pipeline Date: Wed, 16 Oct 2024 16:48:39 -0400 Organization: A noiseless patient Spider Lines: 120 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 16 Oct 2024 22:48:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="0085cf96d908ff9def3e765ffce47f4c"; logging-data="2532222"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Y7/Pid1khN2CIhx2LnqGnuAdext3XvHE=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0 Cancel-Lock: sha1:2PegBdGH/mHmE3aCzK5Igbv4XQA= In-Reply-To: Bytes: 6338 On 10/16/24 1:56 AM, Stephen Fuld wrote: > Even though this is about the MM instruction, and the MM > instruction is mentioned in other threads, they have lots of other > stuff (thread drift), and this isn't related to C, standard or > otherwise, so I thought it best to start a new thread, > > My questions are about what happens to subsequent instructions > that immediately follow the MM in the stream when an MM > instruction is executing.  Since an MM instruction may take quite > a long time (in computer time) to complete I think it is useful to > know what else can happen while the MM is executing. This would seem to be very implementation dependent. Architecturally, no following instructions can execute until after the MM completes. With respect to microarchitecture, an arbitrary amount of parallelism could be provided. > I will phrase this as a series of questions. While Mitch Alsup can answer these more authoritatively, I will take a stab at them. > 1.    I assume that subsequent non-memory reference instructions > can proceed simultaneously with the MM.  Is that correct? This would probably be true even for the in-order scalar implementation. > 2.    Can a load or store where the memory address is in neither > the source nor the destination of the MM proceed simultaneously > with the MM This is a little more complicated than just marking a register as not-ready (for a load destination), so might not be supported in a simple implementation. Memory accesses would have to check both ranges rather than just one of 32 register names or eight store buffer entries. Mitch Alsup's description of the small quasi-scalar core implies to me that the MM instruction would occupy the memory access interface until it is finished. I would guess that any out-of-order implementation would support loads and stores outside of the MM regions to proceed speculatively until the various OoO buffering structures are filled. > 3.    Can a load where the memory address is within the source of > the MM proceed? My guess would be that any OoO implementation would support this. If the implementation checks for a hit in both ranges, it would seem to be little extra effort to allow a load to a 'clean' address to proceed. Supporting this and preventing reads of the destination and all stores would only require one address range check; loads can proceed as long as they are not within the destination. > For the next questions, assume for exposition that the MM has > proceeded to complete 1/3 of the move when the following > instructions come up. > > 4.    Can a load in the first third of the destination range proceed? I would guess that an out-of-order implementation would forward data from all stores performed speculatively by the MM (limited by the store queue). MM stores that are no longer speculative — where an interrupt would place the count — would seem to be naturally handled as if singular committed stores, i.e., following instructions could speculatively execute using those values. > 5.    Can a store in the first third of the source range proceed? In the non-speculative region of the MM, speculative stores could "execute", storing to the store queue. These stores would be squashed if the MM does not fully complete along with all other instructions after the MM. The MM is synchronous. A large MM that is no longer speculative might be implemented as avoiding the store queue to allow more stores after the MM to be speculated. For very large MMs, a copy engine farther from the core might be used. > 6.    Can a store in the first third of the destination range > proceed? Since the MM has architecturally completed to roughly that point (some stores might only have "completed" to the store queue), it would not be difficult to support speculative stores in the completed range for an out-of-order implementation. These stores would be rolled back if the MM does not fully complete and commit. Here is a question that I will leave to Mitch: Can a MM that has confirmed permissions commit before it has been performed such that uncorrectable errors would be recognized not on read of the source but on later read of the destination? I could see some wanting to depend on the copy checking data validity synchronously, but some might be okay with a quasi- synchronous copy that allows the processor to continue doing work outside of the MM. If a translation map is provided for coherence, any MM could commit once it is not speculative but before the actual copy has been performed. Tracking what parts have been completed in the presence of other stores would have significant overhead. For page-aligned copies, a copy-on-write mechanism might be used. There are also cache designs which support deduplication; cache block aligned copies might be faster than physical copying. With lossy/truncated cache compression, unaligned fragments might be deduplicated (and read-for-ownership might be avoided similar to having fine-grained valid bits). I rather suspect that what is physically possible is far broader than what is possible with a finite engineering budget.