Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Chris M. Thomasson" Newsgroups: comp.arch Subject: Re: Memory ordering Date: Tue, 30 Jul 2024 12:56:44 -0700 Organization: A noiseless patient Spider Lines: 59 Message-ID: References: <2024Jul26.190007@mips.complang.tuwien.ac.at> <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org> <2024Jul29.152110@mips.complang.tuwien.ac.at> <2024Jul30.115146@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 30 Jul 2024 21:56:45 +0200 (CEST) Injection-Info: dont-email.me; posting-host="87f39ac7600a900f46d6fc02404b3288"; logging-data="1240422"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+bV+mesgsdYL7tKn8BHFX6ms+B/Gjmndc=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:vbIXn8fatjoRen3plbvKkmGfq6c= Content-Language: en-US In-Reply-To: <2024Jul30.115146@mips.complang.tuwien.ac.at> Bytes: 4377 On 7/30/2024 2:51 AM, Anton Ertl wrote: > mitchalsup@aol.com (MitchAlsup1) writes: >> On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote: >> >>> mitchalsup@aol.com (MitchAlsup1) writes: >>>> On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote: >>>>> Similarly, I expect that hardware that is designed for good TSO or >>>>> sequential consistency performance will run faster on code written for >>>>> this model than code written for weakly consistent hardware will run >>>>> on that hardware. >>>> >>>> According to Lamport; only the ATOMIC stuff needs sequential >>>> consistency. >>>> So, it is completely possible to have a causally consistent processor >>>> that switches to sequential consistency when doing ATOMIC stuff and gain >>>> performance when not doing ATOMIC stuff, and gain programmability when >>>> doing atomic stuff. >>> >>> That's not what I have in mind. What I have in mind is hardware that, >>> e.g., speculatively performs loads, predicting that no other core will >>> store there with an earlier time stamp. But if another core actually >>> performs such a store, the usual misprediction handling happens and >>> the code starting from that mispredicted load is reexecuted. So as >>> long as two cores do not access the same memory, they can run at full >>> speed, and there is only slowdown if there is actual (not potential) >>> communication between the cores. >> >> OK... >>> >>> A problem with that approach is that this requires enough reorder >>> buffering (or something equivalent, there may be something cheaper for >>> this particular problem) to cover at least the shared-cache latency >>> (usually L3, more with multiple sockets). >> >> The depth of the execution window may be smaller than the time it takes >> to send the required information around and have this core recognize >> that it is out-of-order wrt memory. > > So if we don't want to stall for memory accesses all the time, we need > a bigger execution window, either by making the reorder buffer larger, > or by using a different, cheaper mechanism. > > Concerning the cheaper mechanism, what I am thinking of is hardware > checkpointing every, say, 200 cycles or so (subject to fine-tuning). > The idea here is that communication between cores is very rare, so [...] communication between cores _should_ be rare, rare as can be. The software "should" design things to strive to reduce this down to the embarrassingly parallel level. However, misusing, say wrt a bug in the core mapping wrt affinity masks and such can increase core to core communication/traffic... Not good. Terrible case, the traffic is for a remote core, not all that close (physically) to the current core: The core running the code that wants it to deal with another node so to speak. NUMA... If a core/node _must_ communicate with another core/node, the locality in the graph should allow the core/node to "converse" with a "local" core/node as in "gain the closest core/node to the source"!