Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Memory ordering (was: Arguments for a sane ISA 6-years later) Date: Mon, 29 Jul 2024 13:21:10 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 71 Message-ID: <2024Jul29.152110@mips.complang.tuwien.ac.at> References: <2024Jul26.190007@mips.complang.tuwien.ac.at> <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org> Injection-Date: Mon, 29 Jul 2024 16:09:26 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5399d9ef69b5f7b7cf843d1b51aebed9"; logging-data="553332"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19AX3s28iwR50yJtzz0UYTp" Cancel-Lock: sha1:UoL/jwUDVa/6b3iBJ2FIceJT9UU= X-newsreader: xrn 10.11 Bytes: 4715 mitchalsup@aol.com (MitchAlsup1) writes: >On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote: >> Similarly, I expect that hardware that is designed for good TSO or >> sequential consistency performance will run faster on code written for >> this model than code written for weakly consistent hardware will run >> on that hardware. > >According to Lamport; only the ATOMIC stuff needs sequential >consistency. >So, it is completely possible to have a causally consistent processor >that switches to sequential consistency when doing ATOMIC stuff and gain >performance when not doing ATOMIC stuff, and gain programmability when >doing atomic stuff. That's not what I have in mind. What I have in mind is hardware that, e.g., speculatively performs loads, predicting that no other core will store there with an earlier time stamp. But if another core actually performs such a store, the usual misprediction handling happens and the code starting from that mispredicted load is reexecuted. So as long as two cores do not access the same memory, they can run at full speed, and there is only slowdown if there is actual (not potential) communication between the cores. A problem with that approach is that this requires enough reorder buffering (or something equivalent, there may be something cheaper for this particular problem) to cover at least the shared-cache latency (usually L3, more with multiple sockets). >> That's because software written for weakly >> consistent hardware often has to insert barriers or atomic operations >> just in case, and these operations are slow on hardware optimized for >> weak consistency. > >The operations themselves are not slow. Citation needed. >> By contrast, one can design hardware for strong ordering such that the >> slowness occurs only in those cases when actual (not potential) >> communication between the cores happens, i.e., much less frequently. > >How would you do this for a 256-way banked memory system of the >NEC SX ?? I.E., the processor is not in charge of memory order-- >the memory system is. Memory consistency is defined wrt what several processors do. Some processor performs some reads and writes and another performs some read and writes, and memory consistency defines what a processor sees about what the other does, and what ends up in main memory. But as long as the processors, their caches, and their interconnect gets the memory ordering right, the main memory is just the backing store that eventually gets a consistent result of what the other components did. So it does not matter whether the main memory has one bank or 256. One interesting aspect is that for supercomputers I generally think that they have not yet been struck by the software crisis: Supercomputer hardware is more expensive than supercomputer software. So I expect that supercomputer hardware designers tend to throw complexity over the wall to the software people, and in many cases they do (the Cell broadband engine offers many examples of that). However, "some ... Fujitsu [ARM] CPUs run with TSO at all times" ; that sounds like the A64FX, a processor designed for supercomputing. So apparently in this case the hardware designers actually accepted the hardware and design complexity cost of TSO and gave a better model to software, even in hardware designed for a supercomputer. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,