Deutsch English Français Italiano |
<vbimfd$1jbai$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!npeer.as286.net!dummy01.as286.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> Newsgroups: comp.arch Subject: Re: arm ldxr/stxr vs cas Date: Sat, 7 Sep 2024 16:09:34 -0700 Organization: A noiseless patient Spider Lines: 111 Message-ID: <vbimfd$1jbai$1@dont-email.me> References: <vb4sit$2u7e2$1@dont-email.me> <07d60bd0a63b903820013ae60792fb7a@www.novabbs.org> <vbc4u3$aj5s$1@dont-email.me> <898cf44224e9790b74a0269eddff095a@www.novabbs.org> <vbd4k1$fpn6$1@dont-email.me> <vbd91c$g5j0$1@dont-email.me> <vbflk4$uc98$1@dont-email.me> <352e80684e75a2c0a298b84e4bf840c4@www.novabbs.org> <vbhpv0$1de2c$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 08 Sep 2024 01:09:34 +0200 (CEST) Injection-Info: dont-email.me; posting-host="d44d7012c3ec9d16fb2fdce73058c980"; logging-data="1682770"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+pzWUh+aOAdZ7CciZIae/fIh6m1Dxu0mo=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:XGnp2espSEhR/VoDv+Fwtghtcok= In-Reply-To: <vbhpv0$1de2c$1@dont-email.me> Content-Language: en-US Bytes: 6695 On 9/7/2024 8:02 AM, jseigh wrote: > On 9/6/24 15:57, MitchAlsup1 wrote: >> On Fri, 6 Sep 2024 19:36:36 +0000, Chris M. Thomasson wrote: >> >>> On 9/5/2024 2:49 PM, jseigh wrote: >>>> On 9/5/24 16:34, Chris M. Thomasson wrote: >>>>> On 9/5/2024 12:46 PM, MitchAlsup1 wrote: >>>>>> On Thu, 5 Sep 2024 11:33:23 +0000, jseigh wrote: >>>>>> >>>>>>> On 9/4/2024 5:27 PM, MitchAlsup1 wrote: >>>>>>>> On Mon, 2 Sep 2024 17:27:57 +0000, jseigh wrote: >>>>>>>> >>>>>>>>> I read that arm added the cas instruction because they didn't >>>>>>>>> think >>>>>>>>> ldxr/stxr would scale well. It wasn't clear to me as to why that >>>>>>>>> would be the case. I would think the memory lock mechanism would >>>>>>>>> have really low overhead vs cas having to do an interlocked load >>>>>>>>> and store. Unless maybe the memory lock size might be large >>>>>>>>> enough to cause false sharing issues. Any ideas? >>>>>>>> >>>>>>>> A pipeline lock between the LD part of a CAS and the ST part of a >>>>>>>> CAS is essentially FREE. But the same is true for LL followed by >>>>>>>> a later SC. >>>>>>>> >>>>>>>> Older machines with looser than sequential consistency memory >>>>>>>> models >>>>>>>> and running OoO have a myriad of problems with LL - SC. This is >>>>>>>> why My 66000 architecture switches from causal consistency to >>>>>>>> sequential consistency when it encounters <effectively> LL and >>>>>>>> switches bac after seeing SC. >>>>>>>> >>>>>>>> No Fences necessary with causal consistency. >>>>>>>> >>>>>>> >>>>>>> I'm not sure I entirely follow. I was thinking of the effects on >>>>>>> cache. In theory the SC could fail without having get the current >>>>>>> cache line exclusive or at all. CAS has to get it exclusive before >>>>>>> it can definitively fail. >>>>>> >>>>>> A LL that takes a miss in L1 will perform a fetch with intent to >>>>>> modify, >>>>>> so will a CAS. However, LL is allowed to silently fail if >>>>>> exclusive is >>>>>> not returned from its fetch, deferring atomic failure to SC, while >>>>>> CAS >>>>>> will fail when exclusive fails to return. >>>>> >>>>> CAS should only fail when the comparands are not equal to each other. >>>>> Well, then there is the damn weak and strong CAS in C++11... ;^o >>>>> >>>>> >>>>>> LL-SC is designed so that >>>>>> when a failure happens, failure is visible at SC not necessarily >>>>>> at LL. >>>>>> >>>>>> There are coherence protocols that allows the 2nd party to determine >>>>>> if it returns exclusive or not. The example I know is when the 2nd >>>>>> party is already performing an atomic event and it is better to fail >>>>>> the starting atomic event than to fail an ongoing atomic event. >>>>>> In My 66000 the determination is made under the notion of priority:: >>>>>> the higher priority thread is allows to continue while the lower >>>>>> priority thread takes the failure. The higher priority thread can >>>>>> be the requestor (1st party) or the holder of data (2nd party) >>>>>> while all interested observers (3rd parties) are in a position >>>>>> to see what transpired and act accordingly (causal). >>>>>> >>>> >>>> I'm not so sure about making the memory lock granularity same as >>>> cache line size but that's an implementation decision I guess. >>>> >>>> I do like the idea of detecting potential contention at the >>>> start of LL/SC so you can do back off. Right now the only way I >>>> can detect contention is after the fact when the CAS fails and >>>> I probably have the cache line exclusive at that point. It's >>>> pretty problematic. >>> >>> I wonder if the ability to determine why a "weak" CAS failed might help. >>> They (weak) can fail for other reasons besides comparing comparands... >>> Well, would be a little too low level for a general atomic op in >>> C/C++11? >> >> One can detect that the CAS-line is no longer exclusive as a form >> of weak failure, rather than waiting for the data to show up and >> fail strongly on the compare. > > There is no requirement for CAS to calculate the expected value in > any way, though typically the expected value is loaded from the CAS > target. In fact you can use random values and it will still work, > just take a lot longer. A typical optimization for pushing onto > a stack that you expect to be empty more often than not is to > initially load NULL as expected value instead of loading from the > stack anchor, a load immediate vs load from storage. > > x64 doesn't have an atomic 128 bit load but cmpxchg16b works > ok nonetheless. The 2 64 bit loads just have to be effectively > atomic most of the time or you can use the updated result from > cmpxchg16b. > > aarch64 didn't have atomic 128 bit load, LDP, early on. You > have to do a LDXP/STXP to determine if load was atomic. In > practice if you're doing a LDXP/STXP loop anyway it doesn't > matter too much as long as you can handle the occasional > random 128 bit value. > > I have some success with after the fact contention back off. > I get 30% to 50% improvement in most cases. The main challenge > is getting a 100+ nanosecond pause. nanosleep() doesn't hack it. Good point. Humm... I guess it boils down to optimistic vs pessimistic schemes... ?