Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: jseigh Newsgroups: comp.arch Subject: Re: arm ldxr/stxr vs cas Date: Sat, 7 Sep 2024 11:02:56 -0400 Organization: A noiseless patient Spider Lines: 107 Message-ID: References: <07d60bd0a63b903820013ae60792fb7a@www.novabbs.org> <898cf44224e9790b74a0269eddff095a@www.novabbs.org> <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 07 Sep 2024 17:02:57 +0200 (CEST) Injection-Info: dont-email.me; posting-host="1b23ec8a1ab3fdf33e8d4d34f58f5edc"; logging-data="1488972"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18cFI5cX+kNOcUAwOVQXwy4" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:Y/43ZSFBSQv1QJS5pgBTqGIo9pU= Content-Language: en-US In-Reply-To: <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org> Bytes: 6299 On 9/6/24 15:57, MitchAlsup1 wrote: > On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote: > >> On 9/5/2024 2:49 PM, jseigh wrote: >>> On 9/5/24 16:34, Chris M. Thomasson wrote: >>>> On 9/5/2024 12:46 PM, MitchAlsup1 wrote: >>>>> On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote: >>>>> >>>>>> On 9/4/2024 5:27 PM, MitchAlsup1 wrote: >>>>>>> On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote: >>>>>>> >>>>>>>> I read that arm added the cas instruction because they didn't think >>>>>>>> ldxr/stxr would scale well.  It wasn't clear to me as to why that >>>>>>>> would be the case.  I would think the memory lock mechanism would >>>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>>> and store.  Unless maybe the memory lock size might be large >>>>>>>> enough to cause false sharing issues.  Any ideas? >>>>>>> >>>>>>> A pipeline lock between the LD part of a CAS and the ST part of a >>>>>>> CAS is essentially FREE. But the same is true for LL followed by >>>>>>> a later SC. >>>>>>> >>>>>>> Older machines with looser than sequential consistency memory models >>>>>>> and running OoO have a myriad of problems with LL - SC. This is >>>>>>> why My 66000 architecture switches from causal consistency to >>>>>>> sequential consistency when it encounters LL and >>>>>>> switches bac after seeing SC. >>>>>>> >>>>>>> No Fences necessary with causal consistency. >>>>>>> >>>>>> >>>>>> I'm not sure I entirely follow.  I was thinking of the effects on >>>>>> cache.  In theory the SC could fail without having get the current >>>>>> cache line exclusive or at all.  CAS has to get it exclusive before >>>>>> it can definitively fail. >>>>> >>>>> A LL that takes a miss in L1 will perform a fetch with intent to >>>>> modify, >>>>> so will a CAS. However, LL is allowed to silently fail if exclusive is >>>>> not returned from its fetch, deferring atomic failure to SC, while CAS >>>>> will fail when exclusive fails to return. >>>> >>>> CAS should only fail when the comparands are not equal to each other. >>>> Well, then there is the damn weak and strong CAS in C++11... ;^o >>>> >>>> >>>>> LL-SC is designed so that >>>>> when a failure happens, failure is visible at SC not necessarily at >>>>> LL. >>>>> >>>>> There are coherence protocols that allows the 2nd party to determine >>>>> if it returns exclusive or not. The example I know is when the 2nd >>>>> party is already performing an atomic event and it is better to fail >>>>> the starting atomic event than to fail an ongoing atomic event. >>>>> In My 66000 the determination is made under the notion of priority:: >>>>> the higher priority thread is allows to continue while the lower >>>>> priority thread takes the failure. The higher priority thread can >>>>> be the requestor (1st party) or the holder of data (2nd party) >>>>> while all interested observers (3rd parties) are in a position >>>>> to see what transpired and act accordingly (causal). >>>>> >>> >>> I'm not so sure about making the memory lock granularity same as >>> cache line size but that's an implementation decision I guess. >>> >>> I do like the idea of detecting potential contention at the >>> start of LL/SC so you can do back off.  Right now the only way I >>> can detect contention is after the fact when the CAS fails and >>> I probably have the cache line exclusive at that point.  It's >>> pretty problematic. >> >> I wonder if the ability to determine why a "weak" CAS failed might help. >> They (weak) can fail for other reasons besides comparing comparands... >> Well, would be a little too low level for a general atomic op in >> C/C++11? > > One can detect that the CAS-line is no longer exclusive as a form > of weak failure, rather than waiting for the data to show up and > fail strongly on the compare. There is no requirement for CAS to calculate the expected value in any way, though typically the expected value is loaded from the CAS target. In fact you can use random values and it will still work, just take a lot longer. A typical optimization for pushing onto a stack that you expect to be empty more often than not is to initially load NULL as expected value instead of loading from the stack anchor, a load immediate vs load from storage. x64 doesn't have an atomic 128 bit load but cmpxchg16b works ok nonetheless. The 2 64 bit loads just have to be effectively atomic most of the time or you can use the updated result from cmpxchg16b. aarch64 didn't have atomic 128 bit load, LDP, early on. You have to do a LDXP/STXP to determine if load was atomic. In practice if you're doing a LDXP/STXP loop anyway it doesn't matter too much as long as you can handle the occasional random 128 bit value. I have some success with after the fact contention back off. I get 30% to 50% improvement in most cases. The main challenge is getting a 100+ nanosecond pause. nanosleep() doesn't hack it. Joe Seigh