Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>
Newsgroups: comp.arch
Subject: Re: Memory ordering
Date: Tue, 30 Jul 2024 12:56:44 -0700
Organization: A noiseless patient Spider
Lines: 59
Message-ID: <v8bghs$15rb6$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at>
 <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org>
 <2024Jul29.152110@mips.complang.tuwien.ac.at>
 <f8869fa1aadb85896d237179d46b20f8@www.novabbs.org>
 <2024Jul30.115146@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 30 Jul 2024 21:56:45 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="87f39ac7600a900f46d6fc02404b3288";
	logging-data="1240422"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+bV+mesgsdYL7tKn8BHFX6ms+B/Gjmndc="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:vbIXn8fatjoRen3plbvKkmGfq6c=
Content-Language: en-US
In-Reply-To: <2024Jul30.115146@mips.complang.tuwien.ac.at>
Bytes: 4377

On 7/30/2024 2:51 AM, Anton Ertl wrote:
> mitchalsup@aol.com (MitchAlsup1) writes:
>> On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote:
>>
>>> mitchalsup@aol.com (MitchAlsup1) writes:
>>>> On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
>>>>> Similarly, I expect that hardware that is designed for good TSO or
>>>>> sequential consistency performance will run faster on code written for
>>>>> this model than code written for weakly consistent hardware will run
>>>>> on that hardware.
>>>>
>>>> According to Lamport; only the ATOMIC stuff needs sequential
>>>> consistency.
>>>> So, it is completely possible to have a causally consistent processor
>>>> that switches to sequential consistency when doing ATOMIC stuff and gain
>>>> performance when not doing ATOMIC stuff, and gain programmability when
>>>> doing atomic stuff.
>>>
>>> That's not what I have in mind.  What I have in mind is hardware that,
>>> e.g., speculatively performs loads, predicting that no other core will
>>> store there with an earlier time stamp.  But if another core actually
>>> performs such a store, the usual misprediction handling happens and
>>> the code starting from that mispredicted load is reexecuted.  So as
>>> long as two cores do not access the same memory, they can run at full
>>> speed, and there is only slowdown if there is actual (not potential)
>>> communication between the cores.
>>
>> OK...
>>>
>>> A problem with that approach is that this requires enough reorder
>>> buffering (or something equivalent, there may be something cheaper for
>>> this particular problem) to cover at least the shared-cache latency
>>> (usually L3, more with multiple sockets).
>>
>> The depth of the execution window may be smaller than the time it takes
>> to send the required information around and have this core recognize
>> that it is out-of-order wrt memory.
> 
> So if we don't want to stall for memory accesses all the time, we need
> a bigger execution window, either by making the reorder buffer larger,
> or by using a different, cheaper mechanism.
> 
> Concerning the cheaper mechanism, what I am thinking of is hardware
> checkpointing every, say, 200 cycles or so (subject to fine-tuning).
> The idea here is that communication between cores is very rare, so
[...]

communication between cores _should_ be rare, rare as can be. The 
software "should" design things to strive to reduce this down to the 
embarrassingly parallel level. However, misusing, say wrt a bug in the 
core mapping wrt affinity masks and such can increase core to core 
communication/traffic... Not good. Terrible case, the traffic is for a 
remote core, not all that close (physically) to the current core: The 
core running the code that wants it to deal with another node so to 
speak. NUMA...

If a core/node _must_ communicate with another core/node, the locality 
in the graph should allow the core/node to "converse" with a "local" 
core/node as in "gain the closest core/node to the source"!