Deutsch English Français Italiano |
<2024Jul30.115146@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Memory ordering Date: Tue, 30 Jul 2024 09:51:46 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 170 Message-ID: <2024Jul30.115146@mips.complang.tuwien.ac.at> References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org> <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me> <2024Jul26.190007@mips.complang.tuwien.ac.at> <2032da2f7a4c7c8c50d28cacfa26c9c7@www.novabbs.org> <2024Jul29.152110@mips.complang.tuwien.ac.at> <f8869fa1aadb85896d237179d46b20f8@www.novabbs.org> Injection-Date: Tue, 30 Jul 2024 13:00:34 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5b4b96dcde4d50485c7123a5b9fd795f"; logging-data="1067563"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19IQwzXpyutAgpsyDVi8K/i" Cancel-Lock: sha1:CGs1++VDdSwMK9Nmmxt3/6cpFpw= X-newsreader: xrn 10.11 Bytes: 9971 mitchalsup@aol.com (MitchAlsup1) writes: >On Mon, 29 Jul 2024 13:21:10 +0000, Anton Ertl wrote: > >> mitchalsup@aol.com (MitchAlsup1) writes: >>>On Fri, 26 Jul 2024 17:00:07 +0000, Anton Ertl wrote: >>>> Similarly, I expect that hardware that is designed for good TSO or >>>> sequential consistency performance will run faster on code written for >>>> this model than code written for weakly consistent hardware will run >>>> on that hardware. >>> >>>According to Lamport; only the ATOMIC stuff needs sequential >>>consistency. >>>So, it is completely possible to have a causally consistent processor >>>that switches to sequential consistency when doing ATOMIC stuff and gain >>>performance when not doing ATOMIC stuff, and gain programmability when >>>doing atomic stuff. >> >> That's not what I have in mind. What I have in mind is hardware that, >> e.g., speculatively performs loads, predicting that no other core will >> store there with an earlier time stamp. But if another core actually >> performs such a store, the usual misprediction handling happens and >> the code starting from that mispredicted load is reexecuted. So as >> long as two cores do not access the same memory, they can run at full >> speed, and there is only slowdown if there is actual (not potential) >> communication between the cores. > >OK... >> >> A problem with that approach is that this requires enough reorder >> buffering (or something equivalent, there may be something cheaper for >> this particular problem) to cover at least the shared-cache latency >> (usually L3, more with multiple sockets). > >The depth of the execution window may be smaller than the time it takes >to send the required information around and have this core recognize >that it is out-of-order wrt memory. So if we don't want to stall for memory accesses all the time, we need a bigger execution window, either by making the reorder buffer larger, or by using a different, cheaper mechanism. Concerning the cheaper mechanism, what I am thinking of is hardware checkpointing every, say, 200 cycles or so (subject to fine-tuning). The idea here is that communication between cores is very rare, so rolling back more cycles than the minimal necessary amount costs little on average (except that it looks bad on cache ping-pong microbenchmarks). The cost of such a checkpoint is (at most) the number of architectural registers, plus the aggregated stores between the checkpoint and the next one. Once the global time reaches the timestamp of the checkpoint N+1 of the core, checkpoint N of the core can be released (i.e. all its instructions committed) and all its stores can be commited (and checked against speculative loads in other cores). If it turns out that an uncommited load's result has been changed by a store commited by another core, a rollback to the latest checkpoint before the load happens, and the program is re-executed starting from that checkpoint. Daya et al [daya+14] have already implemented sequential consistency in their 36-core research chip, with similar ideas (that inspired my statement above) and much more detail (that makes it hard to see the grand scheme of things IIRC). @InProceedings{daya+14, author = {Bhavya K. Daya and Chia-Hsin Owen Chen and Suvinay Subramanian and Woo-Cheol Kwon and Sunghyun Park and Tushar Krishna and Jim Holt and Anantha P. Chandrakasan and L-Shiuan Peh}, title = {{SCORPIO}: A 36-Core Research-Chip Demonstrating Snoopy Coherence on a Scalable Mesh {NoC} with In-Network Ordering}, crossref = {isca14}, OPTpages = {}, url = {http://projects.csail.mit.edu/wiki/pub/LSPgroup/PublicationList/scorpio_isca2014.pdf}, annote = {The cores on the chip described in this paper access their shared memory in a sequentially consistent manner; what's more, the chip provides a significant speedup in comparison to the distributed directory and HyperTransport coherence protocols. The main idea is to deal with the ordering separately from the data, in a distributed way. The ordering messages are relatively small (one bit per core). For details see the paper.} } @Proceedings{isca14, title = "$41^\textit{st}$ Annual International Symposium on Computer Architecture", booktitle = "$41^\textit{st}$ Annual International Symposium on Computer Architecture", year = "2014", key = "ISCA 2014", } >>>The operations themselves are not slow. >> >> Citation needed. > >A MEMBAR dropped into the pipeline, when nothing is speculative, takes >no more time than an integer ADD. Only when there is speculation does >it have to take time to relax the speculation. Not sure what kind of speculation you mean here. On in-order cores like the non-Fujitsu SPARCs from before about 2010 memory barriers are expensive AFAIK, even though there is essentially no branch speculation on in-order cores. Of course, if you mean speculation about the order of loads and stores, yes, if you don't have such speculation, the memory barriers are fast, but then loads are extremely slow. >> Memory consistency is defined wrt what several processors do. Some >> processor performs some reads and writes and another performs some >> read and writes, and memory consistency defines what a processor sees >> about what the other does, and what ends up in main memory. But as >> long as the processors, their caches, and their interconnect gets the >> memory ordering right, the main memory is just the backing store that >> eventually gets a consistent result of what the other components did. >> So it does not matter whether the main memory has one bank or 256. > >NEC SX is a multi-processor vector machine with the property that >addresses are spewed out as fast as AGEN can perform. These addresses >are routed to banks based on bus-segment and can arrive OoO wrt >how they were spewed out. > >So two processors accessing the same memory using vector LDs will >see a single vector having multiple memory orderings. P[0]V[0] ordered >before P[1]V[0] but P[1]V[1] ordered before P[0]V[1], ... As long as no stores happen, who cares about the order of the loads? When stores happen, the loads are ordered wrt these stores (with stronger memory orderings giving more guarantees). So the number of memory banks does not matter for implementing a strong ordering efficiently. The thinking about memory banks etc. comes when you approach the problem from the other direction: You have some memory subsystem that by itself gives you no consistency guarantees whatsoever, and then you think about what's the minimum you can do to make it actually useful for inter-core communication. And then you write up a paper like @TechReport{adve&gharachorloo95, author = {Sarita V. Adve and Kourosh Gharachorloo}, title = {Shared Memory Consistency Models: A Tutorial}, institution = {Digital Western Research Lab}, year = {1995}, type = {WRL Research Report}, number = {95/7}, annote = {Gives an overview of architectural features of shared-memory computers such as independent memory banks and per-CPU caches, and how they make the (for programmers) most natural consistency model hard to implement, giving examples of programs that can fail with weaker consistency models. It then discusses several categories of weaker consistency models and actual consistency models in these categories, and which ``safety net'' (e.g., memory barrier instructions) programmers need to use to work around the deficiencies of these models. While the authors recognize that programmers find it difficult to use these safety nets correctly and efficiently, it still advocates weaker consistency models, claiming that sequential consistency is too inefficient, by outlining an inefficient implementation (which is of course no proof that no efficient implementation exists). Still the paper is a good introduction to the issues involved.} } - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>