Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Arguments for a sane ISA 6-years later Date: Fri, 26 Jul 2024 18:01:43 -0500 Organization: A noiseless patient Spider Lines: 152 Message-ID: References: <2024Jul26.190007@mips.complang.tuwien.ac.at> <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 27 Jul 2024 01:01:49 +0200 (CEST) Injection-Info: dont-email.me; posting-host="9eaca427c10feecee898430925c61879"; logging-data="3203429"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/vD10Vlo+q+eGnlbmBdRjq6Et/MZ6orOY=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:mMQ5sdPFJtajx5KnlwTIyUAKRW0= In-Reply-To: <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org> Content-Language: en-US Bytes: 7423 On 7/26/2024 3:59 PM, MitchAlsup1 wrote: > On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote: > >> "Chris M. Thomasson" writes: >>> On 7/25/2024 1:09 PM, BGB wrote: >>>> At least with a weak model, software knows that if it doesn't go >>>> through >>>> the rituals, the memory will be stale. >> >> There is no guarantee of staleness, only a lack of stronger ordering >> guarantees. >> >>> The weak model is ideal for me. I know how to program for it >> >> And the fact that this model is so hard to use that few others know >> how to program for it make it ideal for you. >> >>> and it's more efficient >> >> That depends on the hardware. >> >> Yes, the Alpha 21164 with its imprecise exceptions was "more >> efficient" than other hardware for a while, then the Pentium Pro came >> along and gave us precise exceptions and more efficiency.  And >> eventually the Alpha people learned the trick, too, and 21264 provided >> precise exceptions (although they did not admit this) and more >> efficieny. >> >> Similarly, I expect that hardware that is designed for good TSO or >> sequential consistency performance will run faster on code written for >> this model than code written for weakly consistent hardware will run >> on that hardware. > > According to Lamport; only the ATOMIC stuff needs sequential > consistency. > So, it is completely possible to have a causally consistent processor > that switches to sequential consistency when doing ATOMIC stuff and gain > performance when not doing ATOMIC stuff, and gain programmability when > doing atomic stuff. > Probably true. Main thing that matters for consistency is things like mutex locks and shared buffers. Often, for most everything else, consistency can be glossed over. In a traditional weak model, one would flush the caches whenever locking a mutex or similar (to make sure that everything is written out before locking, and post-locking, up to date). Where, here, one assumes that the only time memory is necessarily up to date, is after acquiring a mutex lock (but, for good measure, one can also do a flush when releasing the lock, such that any other threads that gain the lock will have an up-to-date view of anything that happened between gaining and releasing the mutex). If a person is clever, they might try to sidestep the needs for the cache-flushing here, but this may result in any shared memory not being up to date. This works a little better though if one assumes that any actively shared buffers are essentially read-only during the time each thread is doing its work (potentially followed by a consolidation phase, where all the threads flush their caches such that their view of memory is brought back in sync). >>                    That's because software written for weakly >> consistent hardware often has to insert barriers or atomic operations >> just in case, and these operations are slow on hardware optimized for >> weak consistency. > > The operations themselves are not slow. What is slow is delaying the > pipeline until it catches up to the stronger memory model before > proceeding. How I attempted to do no-cache/volatile/atomic operations was roughly: If the cache line seen is not marked as volatile: Flush it; Fetch the line from memory, set it marked as volatile; Do the operation; Set up a mechanism to auto-flush the line. If a volatile line is seen and we are not doing a volatile operation, flush it. Auto-flush: Once we are not doing the volatile memory operation, look again at cache line, see that it is volatile, and flush it. It is vaguely similar for TLB misses, where a TLB miss will load "whatever" from memory, but it is flagged so that the cache will auto-flush it at the nearest opportunity once the offending operation completes (though, with the slight difference that TLB Missed lines are unable to be marked as Dirty in a Store operation). >> >> By contrast, one can design hardware for strong ordering such that the >> slowness occurs only in those cases when actual (not potential) >> communication between the cores happens, i.e., much less frequently. > > How would you do this for a 256-way banked memory system of the > NEC SX ?? I.E., the processor is not in charge of memory order-- > the memory system is. > I have little idea personally how something like TSO could scale to manycore systems, or to systems where there is non-trivial communication latency (such as when threads are running across a LAN, or maybe the internet). Meanwhile, weak consistency models are easier to scale up to high latency. Say, as opposed to local cache flushing, the shared mutex lock effectively involves sending all of the dirty pages back to a server over a TCP socket or something; followed by re-downloading any of the shared pages afterwards (though, maybe with the server being able to signal which pages on the server might still be up-to-date from the client's POV and avoid the need to re-download them). Granted, for high latency contexts, generally message-passing tends to become preferable to shared memory; but definitions here can get fuzzy. Like, depending on how one looks at it, multiple players connected to a Minecraft server could be considered as a usage of a high-latency shared memory system (with the Minecraft terrain being essentially a sort of shared memory; albeit expressed via message passing over TCP/IP). Though, for general use, it would make sense to limit the scope of "shared memory" to something more resembling a traditional "mmap()" style operation (with a mechanism to detect when pages are dirty and to then re-synchronize them). >> >>> and sometimes use cases do not care if they encounter "stale" data. >> >> Great.  Unless these "sometimes" cases are more often than the cases >> where you perform some atomic operation or barrier because of >> potential, but not actual communication between cores, the weak model >> is still slower than a well-implemented strong model. >> >> - anton