Article <vbimo3$1jbai$2@dont-email.me>

Deutsch English Français Italiano
<vbimo3$1jbai$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>
Newsgroups: comp.arch
Subject: Re: arm ldxr/stxr vs cas
Date: Sat, 7 Sep 2024 16:14:12 -0700
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <vbimo3$1jbai$2@dont-email.me>
References: <vb4sit$2u7e2$1@dont-email.me>
 <07d60bd0a63b903820013ae60792fb7a@www.novabbs.org>
 <vbc4u3$aj5s$1@dont-email.me>
 <898cf44224e9790b74a0269eddff095a@www.novabbs.org>
 <vbd4k1$fpn6$1@dont-email.me> <vbd91c$g5j0$1@dont-email.me>
 <vbflk4$uc98$1@dont-email.me>
 <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org>
 <vbhpv0$1de2c$1@dont-email.me> <vbimfd$1jbai$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 08 Sep 2024 01:14:12 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="d44d7012c3ec9d16fb2fdce73058c980";
	logging-data="1682770"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+2xiqIu0GwHUAL3APdVXIjVRnGoEyWdvk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:e2jrb+SyQDGP9S7A7f2d9NItPT0=
Content-Language: en-US
In-Reply-To: <vbimfd$1jbai$1@dont-email.me>
Bytes: 6976

On 9/7/2024 4:09 PM, Chris M. Thomasson wrote:
> On 9/7/2024 8:02 AM, jseigh wrote:
>> On 9/6/24 15:57, MitchAlsup1 wrote:
>>> On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote:
>>>
>>>> On 9/5/2024 2:49 PM, jseigh wrote:
>>>>> On 9/5/24 16:34, Chris M. Thomasson wrote:
>>>>>> On 9/5/2024 12:46 PM, MitchAlsup1 wrote:
>>>>>>> On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote:
>>>>>>>
>>>>>>>> On 9/4/2024 5:27 PM, MitchAlsup1 wrote:
>>>>>>>>> On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote:
>>>>>>>>>
>>>>>>>>>> I read that arm added the cas instruction because they didn't 
>>>>>>>>>> think
>>>>>>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that
>>>>>>>>>> would be the case.  I would think the memory lock mechanism would
>>>>>>>>>> have really low overhead vs cas having to do an interlocked load
>>>>>>>>>> and store.  Unless maybe the memory lock size might be large
>>>>>>>>>> enough to cause false sharing issues.  Any ideas?
>>>>>>>>>
>>>>>>>>> A pipeline lock between the LD part of a CAS and the ST part of a
>>>>>>>>> CAS is essentially FREE. But the same is true for LL followed by
>>>>>>>>> a later SC.
>>>>>>>>>
>>>>>>>>> Older machines with looser than sequential consistency memory 
>>>>>>>>> models
>>>>>>>>> and running OoO have a myriad of problems with LL - SC. This is
>>>>>>>>> why My 66000 architecture switches from causal consistency to
>>>>>>>>> sequential consistency when it encounters <effectively> LL and
>>>>>>>>> switches bac after seeing SC.
>>>>>>>>>
>>>>>>>>> No Fences necessary with causal consistency.
>>>>>>>>>
>>>>>>>>
>>>>>>>> I'm not sure I entirely follow.  I was thinking of the effects on
>>>>>>>> cache.  In theory the SC could fail without having get the current
>>>>>>>> cache line exclusive or at all.  CAS has to get it exclusive before
>>>>>>>> it can definitively fail.
>>>>>>>
>>>>>>> A LL that takes a miss in L1 will perform a fetch with intent to 
>>>>>>> modify,
>>>>>>> so will a CAS. However, LL is allowed to silently fail if 
>>>>>>> exclusive is
>>>>>>> not returned from its fetch, deferring atomic failure to SC, 
>>>>>>> while CAS
>>>>>>> will fail when exclusive fails to return.
>>>>>>
>>>>>> CAS should only fail when the comparands are not equal to each other.
>>>>>> Well, then there is the damn weak and strong CAS in C++11... ;^o
>>>>>>
>>>>>>
>>>>>>> LL-SC is designed so that
>>>>>>> when a failure happens, failure is visible at SC not necessarily 
>>>>>>> at LL.
>>>>>>>
>>>>>>> There are coherence protocols that allows the 2nd party to determine
>>>>>>> if it returns exclusive or not. The example I know is when the 2nd
>>>>>>> party is already performing an atomic event and it is better to fail
>>>>>>> the starting atomic event than to fail an ongoing atomic event.
>>>>>>> In My 66000 the determination is made under the notion of priority::
>>>>>>> the higher priority thread is allows to continue while the lower
>>>>>>> priority thread takes the failure. The higher priority thread can
>>>>>>> be the requestor (1st party) or the holder of data (2nd party)
>>>>>>> while all interested observers (3rd parties) are in a position
>>>>>>> to see what transpired and act accordingly (causal).
>>>>>>>
>>>>>
>>>>> I'm not so sure about making the memory lock granularity same as
>>>>> cache line size but that's an implementation decision I guess.
>>>>>
>>>>> I do like the idea of detecting potential contention at the
>>>>> start of LL/SC so you can do back off.  Right now the only way I
>>>>> can detect contention is after the fact when the CAS fails and
>>>>> I probably have the cache line exclusive at that point.  It's
>>>>> pretty problematic.
>>>>
>>>> I wonder if the ability to determine why a "weak" CAS failed might 
>>>> help.
>>>> They (weak) can fail for other reasons besides comparing comparands...
>>>> Well, would be a little too low level for a general atomic op in
>>>> C/C++11?
>>>
>>> One can detect that the CAS-line is no longer exclusive as a form
>>> of weak failure, rather than waiting for the data to show up and
>>> fail strongly on the compare.
>>
>> There is no requirement for CAS to calculate the expected value in
>> any way, though typically the expected value is loaded from the CAS
>> target.  In fact you can use random values and it will still work,
>> just take a lot longer.  A typical optimization for pushing onto
>> a stack that you expect to be empty more often than not is to
>> initially load NULL as expected value instead of loading from the
>> stack anchor, a load immediate vs load from storage.
>>
>> x64 doesn't have an atomic 128 bit load but cmpxchg16b works
>> ok nonetheless.  The 2 64 bit loads just have to be effectively
>> atomic most of the time or you can use the updated result from
>> cmpxchg16b.
>>
>> aarch64 didn't have atomic 128 bit load, LDP, early on. You
>> have to do a LDXP/STXP to determine if load was atomic.  In
>> practice if you're doing a LDXP/STXP loop anyway it doesn't
>> matter too much as long as you can handle the occasional
>> random 128 bit value.
>>
>> I have some success with after the fact contention back off.
>> I get 30% to 50% improvement in most cases.  The main challenge
>> is getting a 100+ nanosecond pause.  nanosleep() doesn't hack it.
> 
> Good point. Humm... I guess it boils down to optimistic vs pessimistic 
> schemes... ?
> 

When I am using CAS I don't really expect it to fail willy nilly even if 
the comparands are still the same. Weak vs Strong. Still irks me a bit. ;^)