Article <v88rcp$kch7$1@dont-email.me>

Deutsch English Français Italiano
<v88rcp$kch7$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Mon, 29 Jul 2024 14:43:19 -0500
Organization: A noiseless patient Spider
Lines: 299
Message-ID: <v88rcp$kch7$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at> <v872h5$alfu$2@dont-email.me>
 <v87g4i$cvih$1@dont-email.me> <v88oid$jkah$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 29 Jul 2024 21:43:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c8aa09e2d45f8fb38190453b39c47ea3";
	logging-data="668199"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+q49bAB89qbwhKKM9hW5bm3Qw1RymktSI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:FUrAlDeBZiRIe5hLkU5OdSdY3mw=
In-Reply-To: <v88oid$jkah$1@dont-email.me>
Content-Language: en-US
Bytes: 13284

On 7/29/2024 1:55 PM, Chris M. Thomasson wrote:
> On 7/29/2024 12:25 AM, BGB wrote:
>> On 7/28/2024 10:32 PM, Chris M. Thomasson wrote:
>>> On 7/26/2024 10:00 AM, Anton Ertl wrote:
>>>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>>>> On 7/25/2024 1:09 PM, BGB wrote:
>>>>>> At least with a weak model, software knows that if it doesn't go 
>>>>>> through
>>>>>> the rituals, the memory will be stale.
>>>>
>>>> There is no guarantee of staleness, only a lack of stronger ordering
>>>> guarantees.
>>>>
>>>>> The weak model is ideal for me. I know how to program for it
>>>>
>>>> And the fact that this model is so hard to use that few others know
>>>> how to program for it make it ideal for you.
>>>>
>>>>> and it's more efficient
>>>>
>>>> That depends on the hardware.
>>>>
>>>> Yes, the Alpha 21164 with its imprecise exceptions was "more
>>>> efficient" than other hardware for a while, then the Pentium Pro came
>>>> along and gave us precise exceptions and more efficiency.  And
>>>> eventually the Alpha people learned the trick, too, and 21264 provided
>>>> precise exceptions (although they did not admit this) and more
>>>> efficieny.
>>>>
>>>> Similarly, I expect that hardware that is designed for good TSO or
>>>> sequential consistency performance will run faster on code written for
>>>> this model than code written for weakly consistent hardware will run
>>>> on that hardware.  That's because software written for weakly
>>>> consistent hardware often has to insert barriers or atomic operations
>>>> just in case, and these operations are slow on hardware optimized for
>>>> weak consistency.
>>>>
>>>> By contrast, one can design hardware for strong ordering such that the
>>>> slowness occurs only in those cases when actual (not potential)
>>>> communication between the cores happens, i.e., much less frequently.
>>>>
>>>>> and sometimes use cases do not care if they encounter "stale" data.
>>>>
>>>> Great.  Unless these "sometimes" cases are more often than the cases
>>>> where you perform some atomic operation or barrier because of
>>>> potential, but not actual communication between cores, the weak model
>>>> is still slower than a well-implemented strong model.
>>>
>>> A strong model? You mean I don't have to use any memory barriers at 
>>> all? Tell that to SPARC in RMO mode... How strong? Even the x86 
>>> requires a membar when a store followed by a load to another location 
>>> shall be respected wrt order. Store-Load. #StoreLoad over on SPARC. ;^)
>>>
>>> If you can force everything to be #StoreLoad (*) and make it faster 
>>> than a handcrafted algo on a very weak memory system, well, hats off! 
>>> I thought it was easier for a HW guy to implement weak consistency? 
>>> At the cost of the increased complexity wrt programming the sucker! ;^)
>>>
>>
>> Programming for a weak model isn't that hard...
>>
>> Well, unless the program is built around a "naive lock free" strategy 
>> (where the threads manipulate members in a data-structure or similar 
>> and assume that the other threads will see the updates in a 
>> more-or-less consistent way).
> 
> lock/wait-free algorithms are very nice. Yes they can be fairly hard, 
> but can be done for sure; stable and working in 100% correct order. The 
> good ones are hard to beat using all locking logic. Try to beat RCU 
> using a read write lock? I have some interesting algorithms that work 
> like a charm.
> 

The issue is that if one takes some kinds of naive lock-free algorithms 
(say, written for x86 or similar), and throw them unchanged on something 
running a weak model, they will not work correctly.


Previously, this could be made to work using "knocking" by adding extra 
memory loads to the mix.

At present (with the associative "VCA cache"), one would also need to 
also use "INVDC" instructions to flush cache lines.


> 
>> Though, one does have the issue that one can't just use cheap spinlocks.
> 
> One note... Spinlocks work in a very weak memory model for sure. You 
> just need the right memory barrier logic... For instance, SPARC in RMO 
> mode wrt locking a spinlock and/or mutex requires a #LoadStore | 
> #LoadLoad membar _after_ the atomic logic that actually locks it occurs. 
> It also requires a release membar #LoadStore | #StoreStore _before_ the 
> atomic logic that unlocks it takes place. Take note that #StoreLoad is 
> _not_ required for a spinlock or a mutex in this context...
> 
> However... There is "special" mutex logic that actually requires a 
> #StoreLoad! Peterson's algorithm for example. Iirc, it needs a 
> #StoreLoad because it depends on a store followed by a load to another 
> location to hold true. This is a bit different than other locking 
> algorithms...
> 
> There there are more "exotic" methods such as so-called asymmetric 
> mutexes. They can have fast paths and slow paths, so to speak. It's 
> almost getting into the realm of RCU here... A fast path can be memory 
> barrier free. The slow path can make things consistent with the use of 
> so called "remote" memory barriers. It's funny that Windows seems to 
> have one:
> 
> https://learn.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-flushprocesswritebuffers
> 
> ;^)
> 
> The slow path is meant to not be frequently used, hence the term 
> asymmetric. On par with read/write logic... :^)
> 
> Should have some more time to respond to the rest of your post tonight 
> or tomorrow. I am a bit busy right now.
> 


I could have been more specific:
The issue is not that one can't make spinlocks work;
But, rather, that (unlike x86), a spinlock is no longer sufficient by 
itself to give consistent access to a memory object.

Like, if you expect to just:
   Lock a spinlock;
   Update a sensitive memory object;
   Unlock the spinlock.

This isn't going to work correctly.

To deal with the general case (where memory is just updated as normal), 
one needs, say:
   Flush the cache;
   Lock the spinlock;
   Do whatever;
   Flush cache again;
   Release the spinlock.

But, if one is going to do this, may as well just do it via a system 
call, which has more resources to both flush the cache effectively and 
do the synchronous memory accesses needed for the lock.



But, yeah, if a person knows what they are doing, one can use spinlocks 
and lock free algorithms with a weak model.

Just not using the "naive" strategies that depend on TSO or similar 
(AKA: "just do it and assume that it works").


Actually, "naive spinlock" might not even use something like 
InterlockedExchange or similar, but say:
   volatile int *lock;
   ...
   v=*lock;
   while(v)
   {
      v=*lock;
      *lock=MAGIC;
   }

Then, it doesn't work, but people might be like "but it worked on x86...".

But, alas...



Actually, a vaguely similar sort of issue came up when porting ROTT to 
Windows and BJX2, as it would occasionally deadlock. There were cases 
where it was going into loops over global variables and just sort of 
expecting the variables to be asynchronously updated by interrupt 
handlers (which wasn't a thing in these ports, as everything here is 
========== REMAINDER OF ARTICLE TRUNCATED ==========