Article <vbhpv0$1de2c$1@dont-email.me>

Deutsch English Français Italiano
<vbhpv0$1de2c$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jseigh <jseigh_es00@xemaps.com>
Newsgroups: comp.arch
Subject: Re: arm ldxr/stxr vs cas
Date: Sat, 7 Sep 2024 11:02:56 -0400
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <vbhpv0$1de2c$1@dont-email.me>
References: <vb4sit$2u7e2$1@dont-email.me>
 <07d60bd0a63b903820013ae60792fb7a@www.novabbs.org>
 <vbc4u3$aj5s$1@dont-email.me>
 <898cf44224e9790b74a0269eddff095a@www.novabbs.org>
 <vbd4k1$fpn6$1@dont-email.me> <vbd91c$g5j0$1@dont-email.me>
 <vbflk4$uc98$1@dont-email.me>
 <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 07 Sep 2024 17:02:57 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1b23ec8a1ab3fdf33e8d4d34f58f5edc";
	logging-data="1488972"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18cFI5cX+kNOcUAwOVQXwy4"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Y/43ZSFBSQv1QJS5pgBTqGIo9pU=
Content-Language: en-US
In-Reply-To: <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org>
Bytes: 6299

On 9/6/24 15:57, MitchAlsup1 wrote:
> On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:
> 
>> On 9/5/2024 2:49 PM, jseigh wrote:
>>> On 9/5/24 16:34, Chris M. Thomasson wrote:
>>>> On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
>>>>> On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
>>>>>
>>>>>> On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
>>>>>>> On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
>>>>>>>
>>>>>>>> I read that arm added the cas instruction because they didn't think
>>>>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that
>>>>>>>> would be the case.  I would think the memory lock mechanism would
>>>>>>>> have really low overhead vs cas having to do an interlocked load
>>>>>>>> and store.  Unless maybe the memory lock size might be large
>>>>>>>> enough to cause false sharing issues.  Any ideas?
>>>>>>>
>>>>>>> A pipeline lock between the LD part of a CAS and the ST part of a
>>>>>>> CAS is essentially FREE. But the same is true for LL followed by
>>>>>>> a later SC.
>>>>>>>
>>>>>>> Older machines with looser than sequential consistency memory models
>>>>>>> and running OoO have a myriad of problems with LL - SC. This is
>>>>>>> why My 66000 architecture switches from causal consistency to
>>>>>>> sequential consistency when it encounters <effectively> LL and
>>>>>>> switches bac after seeing SC.
>>>>>>>
>>>>>>> No Fences necessary with causal consistency.
>>>>>>>
>>>>>>
>>>>>> I'm not sure I entirely follow.  I was thinking of the effects on
>>>>>> cache.  In theory the SC could fail without having get the current
>>>>>> cache line exclusive or at all.  CAS has to get it exclusive before
>>>>>> it can definitively fail.
>>>>>
>>>>> A LL that takes a miss in L1 will perform a fetch with intent to 
>>>>> modify,
>>>>> so will a CAS. However, LL is allowed to silently fail if exclusive is
>>>>> not returned from its fetch, deferring atomic failure to SC, while CAS
>>>>> will fail when exclusive fails to return.
>>>>
>>>> CAS should only fail when the comparands are not equal to each other.
>>>> Well, then there is the damn weak and strong CAS in C++11... ;^o
>>>>
>>>>
>>>>> LL-SC is designed so that
>>>>> when a failure happens, failure is visible at SC not necessarily at 
>>>>> LL.
>>>>>
>>>>> There are coherence protocols that allows the 2nd party to determine
>>>>> if it returns exclusive or not. The example I know is when the 2nd
>>>>> party is already performing an atomic event and it is better to fail
>>>>> the starting atomic event than to fail an ongoing atomic event.
>>>>> In My 66000 the determination is made under the notion of priority::
>>>>> the higher priority thread is allows to continue while the lower
>>>>> priority thread takes the failure. The higher priority thread can
>>>>> be the requestor (1st party) or the holder of data (2nd party)
>>>>> while all interested observers (3rd parties) are in a position
>>>>> to see what transpired and act accordingly (causal).
>>>>>
>>>
>>> I'm not so sure about making the memory lock granularity same as
>>> cache line size but that's an implementation decision I guess.
>>>
>>> I do like the idea of detecting potential contention at the
>>> start of LL/SC so you can do back off.  Right now the only way I
>>> can detect contention is after the fact when the CAS fails and
>>> I probably have the cache line exclusive at that point.  It's
>>> pretty problematic.
>>
>> I wonder if the ability to determine why a "weak" CAS failed might help.
>> They (weak) can fail for other reasons besides comparing comparands...
>> Well, would be a little too low level for a general atomic op in
>> C/C++11?
> 
> One can detect that the CAS-line is no longer exclusive as a form
> of weak failure, rather than waiting for the data to show up and
> fail strongly on the compare.

There is no requirement for CAS to calculate the expected value in
any way, though typically the expected value is loaded from the CAS
target.  In fact you can use random values and it will still work,
just take a lot longer.  A typical optimization for pushing onto
a stack that you expect to be empty more often than not is to
initially load NULL as expected value instead of loading from the
stack anchor, a load immediate vs load from storage.

x64 doesn't have an atomic 128 bit load but cmpxchg16b works
ok nonetheless.  The 2 64 bit loads just have to be effectively
atomic most of the time or you can use the updated result from
cmpxchg16b.

aarch64 didn't have atomic 128 bit load, LDP, early on. You
have to do a LDXP/STXP to determine if load was atomic.  In
practice if you're doing a LDXP/STXP loop anyway it doesn't
matter too much as long as you can handle the occasional
random 128 bit value.

I have some success with after the fact contention back off.
I get 30% to 50% improvement in most cases.  The main challenge
is getting a 100+ nanosecond pause.  nanosleep() doesn't hack it.

Joe Seigh