Deutsch English Français Italiano |
<2024Nov16.084617@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Memory ordering Date: Sat, 16 Nov 2024 07:46:17 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 84 Message-ID: <2024Nov16.084617@mips.complang.tuwien.ac.at> References: <vfono1$14l9r$1@dont-email.me> <vgm4vj$3d2as$1@dont-email.me> <vgm5cb$3d2as$3@dont-email.me> <YfxXO.384093$EEm7.56154@fx16.iad> <vh4530$2mar5$1@dont-email.me> <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com> <vh5t5b$312cl$2@dont-email.me> <5yqdnU9eL_Y_GKv6nZ2dnZfqn_GdnZ2d@supernews.com> <2024Nov15.082512@mips.complang.tuwien.ac.at> <vh7rlr$3fu9i$1@dont-email.me> <2024Nov15.182737@mips.complang.tuwien.ac.at> <vh8cbo$3j8c5$1@dont-email.me> Injection-Date: Sat, 16 Nov 2024 09:52:36 +0100 (CET) Injection-Info: dont-email.me; posting-host="668679466aa50fd1b7dfc1c090e9ffee"; logging-data="4141672"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+DSG4Cn3pTyi9ft5NEEnFX" Cancel-Lock: sha1:xZ2Tq1QOvp0pxhy/5qPE1dBziT0= X-newsreader: xrn 10.11 Bytes: 5101 BGB <cr88192@gmail.com> writes: > The tradeoff is more about implementation cost, performance, etc. Yes. And the "etc." includes "ease of programming". >Weak model: > Cheaper (and simpler) to implement; Yes. > Performs better when there is no need to synchronize memory; Not in general. For a cheap multiprocessor implementation, yes. A sophisticated implementation of sequential consistency can just storm ahead in that case and achieve the same performance. It just has to keep checkpoints around in case that there is a need to synchronize memory. > Performs worse when there is need to synchronize memory; With a cheap multiprocessor implementation, yes. In general, no: Any sequentially consistent implementation is also an implementation of every weaker memory model, and the memory barriers become nops in that kind of implementation. Ok, nops still have a cost, but it's very close to 0 on a modern CPU. Another potential performance disadvantage of sequential consistency even with a sophisticated implementation: If you have some algorithm that actually works correctly even when it gets stale data from a load (with some limits on the staleness), the sophisticated SC implementation will incur the latency coming from making the load non-stale while that latency will not occur or be less in a similarly-sophisticated implementation of an appropriate weak consistency model. However, given that the access to actually-shared memory is slow even on weakly-consistent hardware, software usually takes measures to avoid having a lot of such accesses, so that cost will usually be miniscule. What you missed: the big cost of weak memory models and cheap hardware implementations of them is in the software: * For correctness, the safe way is to insert a memory barrier between any two memory operations. * For performance (on cheap implementations of weak memory models) you want to execute as few memory barriers as possible. * You cannot use testing to find out whether you have enough (and the right) memory barriers. That's not only because the involved threads may not be in the right state during testing for uncovering the incorrectness, but also because the hardware used for testing may actually have stronger consistency than the memory model, and so some kinds of bugs will never show up in testing on that hardware, even when the threads reach the right state. And testing is still the go-to solution for software people to find errors (nowadays even glorified by continuous integration and modern fuzz testing approaches). The result is that a lot of software dealing with shared memory is incorrect because it does not have a memory barrier that it should have, or inefficient on cheap hardware with expensive memory barriers because it uses more memory barriers than necessary for the memory model. A program may even be incorrect in one place and have superflouous memory barriers in another one. Or programmers just don't do this stuff at all (as advocated by jseigh), and instead just write sequential programs, or use bottled solutions that often are a lot more expensive than superfluous memory barriers. E.g., in Gforth the primary inter-thread communication mechanism is currently implemented with pipes, involving the system calls read() and write(). And Bernd Paysan who implemented that is a really good programmer; I am sure he would be able to wrap his head around the whole memory model stuff and implement something much more efficient, but that would take time that he obviously prefers to spend on more productive things. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>