Article <v819ss$31ob5$1@dont-email.me>

Deutsch English Français Italiano
<v819ss$31ob5$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Fri, 26 Jul 2024 18:01:43 -0500
Organization: A noiseless patient Spider
Lines: 152
Message-ID: <v819ss$31ob5$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at>
 <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 27 Jul 2024 01:01:49 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9eaca427c10feecee898430925c61879";
	logging-data="3203429"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/vD10Vlo+q+eGnlbmBdRjq6Et/MZ6orOY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:mMQ5sdPFJtajx5KnlwTIyUAKRW0=
In-Reply-To: <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org>
Content-Language: en-US
Bytes: 7423

On 7/26/2024 3:59 PM, MitchAlsup1 wrote:
> On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote:
> 
>> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>>> On 7/25/2024 1:09 PM, BGB wrote:
>>>> At least with a weak model, software knows that if it doesn't go 
>>>> through
>>>> the rituals, the memory will be stale.
>>
>> There is no guarantee of staleness, only a lack of stronger ordering
>> guarantees.
>>
>>> The weak model is ideal for me. I know how to program for it
>>
>> And the fact that this model is so hard to use that few others know
>> how to program for it make it ideal for you.
>>
>>> and it's more efficient
>>
>> That depends on the hardware.
>>
>> Yes, the Alpha 21164 with its imprecise exceptions was "more
>> efficient" than other hardware for a while, then the Pentium Pro came
>> along and gave us precise exceptions and more efficiency.  And
>> eventually the Alpha people learned the trick, too, and 21264 provided
>> precise exceptions (although they did not admit this) and more
>> efficieny.
>>
>> Similarly, I expect that hardware that is designed for good TSO or
>> sequential consistency performance will run faster on code written for
>> this model than code written for weakly consistent hardware will run
>> on that hardware.
> 
> According to Lamport; only the ATOMIC stuff needs sequential
> consistency.
> So, it is completely possible to have a causally consistent processor
> that switches to sequential consistency when doing ATOMIC stuff and gain
> performance when not doing ATOMIC stuff, and gain programmability when
> doing atomic stuff.
> 

Probably true.

Main thing that matters for consistency is things like mutex locks and 
shared buffers.

Often, for most everything else, consistency can be glossed over.


In a traditional weak model, one would flush the caches whenever locking 
a mutex or similar (to make sure that everything is written out before 
locking, and post-locking, up to date).

Where, here, one assumes that the only time memory is necessarily up to 
date, is after acquiring a mutex lock (but, for good measure, one can 
also do a flush when releasing the lock, such that any other threads 
that gain the lock will have an up-to-date view of anything that 
happened between gaining and releasing the mutex).


If a person is clever, they might try to sidestep the needs for the 
cache-flushing here, but this may result in any shared memory not being 
up to date.

This works a little better though if one assumes that any actively 
shared buffers are essentially read-only during the time each thread is 
doing its work (potentially followed by a consolidation phase, where all 
the threads flush their caches such that their view of memory is brought 
back in sync).



>>                    That's because software written for weakly
>> consistent hardware often has to insert barriers or atomic operations
>> just in case, and these operations are slow on hardware optimized for
>> weak consistency.
> 
> The operations themselves are not slow. What is slow is delaying the
> pipeline until it catches up to the stronger memory model before
> proceeding.

How I attempted to do no-cache/volatile/atomic operations was roughly:
   If the cache line seen is not marked as volatile:
     Flush it;
   Fetch the line from memory, set it marked as volatile;
   Do the operation;
   Set up a mechanism to auto-flush the line.

If a volatile line is seen and we are not doing a volatile operation, 
flush it.

Auto-flush: Once we are not doing the volatile memory operation, look 
again at cache line, see that it is volatile, and flush it.

It is vaguely similar for TLB misses, where a TLB miss will load 
"whatever" from memory, but it is flagged so that the cache will 
auto-flush it at the nearest opportunity once the offending operation 
completes (though, with the slight difference that TLB Missed lines are 
unable to be marked as Dirty in a Store operation).


>>
>> By contrast, one can design hardware for strong ordering such that the
>> slowness occurs only in those cases when actual (not potential)
>> communication between the cores happens, i.e., much less frequently.
> 
> How would you do this for a 256-way banked memory system of the
> NEC SX ?? I.E., the processor is not in charge of memory order--
> the memory system is.
> 

I have little idea personally how something like TSO could scale to 
manycore systems, or to systems where there is non-trivial communication 
latency (such as when threads are running across a LAN, or maybe the 
internet).


Meanwhile, weak consistency models are easier to scale up to high latency.

Say, as opposed to local cache flushing, the shared mutex lock 
effectively involves sending all of the dirty pages back to a server 
over a TCP socket or something; followed by re-downloading any of the 
shared pages afterwards (though, maybe with the server being able to 
signal which pages on the server might still be up-to-date from the 
client's POV and avoid the need to re-download them).


Granted, for high latency contexts, generally message-passing tends to 
become preferable to shared memory; but definitions here can get fuzzy.

Like, depending on how one looks at it, multiple players connected to a 
Minecraft server could be considered as a usage of a high-latency shared 
memory system (with the Minecraft terrain being essentially a sort of 
shared memory; albeit expressed via message passing over TCP/IP).


Though, for general use, it would make sense to limit the scope of 
"shared memory" to something more resembling a traditional "mmap()" 
style operation (with a mechanism to detect when pages are dirty and to 
then re-synchronize them).


>>
>>> and sometimes use cases do not care if they encounter "stale" data.
>>
>> Great.  Unless these "sometimes" cases are more often than the cases
>> where you perform some atomic operation or barrier because of
>> potential, but not actual communication between cores, the weak model
>> is still slower than a well-implemented strong model.
>>
>> - anton