Article <2024Nov16.084617@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Nov16.084617@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Memory ordering
Date: Sat, 16 Nov 2024 07:46:17 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 84
Message-ID: <2024Nov16.084617@mips.complang.tuwien.ac.at>
References: <vfono1$14l9r$1@dont-email.me> <vgm4vj$3d2as$1@dont-email.me> <vgm5cb$3d2as$3@dont-email.me> <YfxXO.384093$EEm7.56154@fx16.iad> <vh4530$2mar5$1@dont-email.me> <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com> <vh5t5b$312cl$2@dont-email.me> <5yqdnU9eL_Y_GKv6nZ2dnZfqn_GdnZ2d@supernews.com> <2024Nov15.082512@mips.complang.tuwien.ac.at> <vh7rlr$3fu9i$1@dont-email.me> <2024Nov15.182737@mips.complang.tuwien.ac.at> <vh8cbo$3j8c5$1@dont-email.me>
Injection-Date: Sat, 16 Nov 2024 09:52:36 +0100 (CET)
Injection-Info: dont-email.me; posting-host="668679466aa50fd1b7dfc1c090e9ffee";
	logging-data="4141672"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+DSG4Cn3pTyi9ft5NEEnFX"
Cancel-Lock: sha1:xZ2Tq1QOvp0pxhy/5qPE1dBziT0=
X-newsreader: xrn 10.11
Bytes: 5101

BGB <cr88192@gmail.com> writes:
>   The tradeoff is more about implementation cost, performance, etc.

Yes.  And the "etc." includes "ease of programming".

>Weak model:
>   Cheaper (and simpler) to implement;

Yes.

>   Performs better when there is no need to synchronize memory;

Not in general.  For a cheap multiprocessor implementation, yes.  A
sophisticated implementation of sequential consistency can just storm
ahead in that case and achieve the same performance.  It just has to
keep checkpoints around in case that there is a need to synchronize
memory.

>   Performs worse when there is need to synchronize memory;

With a cheap multiprocessor implementation, yes.  In general, no: Any
sequentially consistent implementation is also an implementation of
every weaker memory model, and the memory barriers become nops in that
kind of implementation.  Ok, nops still have a cost, but it's very
close to 0 on a modern CPU.

Another potential performance disadvantage of sequential consistency
even with a sophisticated implementation:

If you have some algorithm that actually works correctly even when it
gets stale data from a load (with some limits on the staleness), the
sophisticated SC implementation will incur the latency coming from
making the load non-stale while that latency will not occur or be less
in a similarly-sophisticated implementation of an appropriate weak
consistency model.

However, given that the access to actually-shared memory is slow even
on weakly-consistent hardware, software usually takes measures to
avoid having a lot of such accesses, so that cost will usually be
miniscule.


What you missed: the big cost of weak memory models and cheap hardware
implementations of them is in the software:

* For correctness, the safe way is to insert a memory barrier between
  any two memory operations.

* For performance (on cheap implementations of weak memory models) you
  want to execute as few memory barriers as possible.

* You cannot use testing to find out whether you have enough (and the
  right) memory barriers.  That's not only because the involved
  threads may not be in the right state during testing for uncovering
  the incorrectness, but also because the hardware used for testing
  may actually have stronger consistency than the memory model, and so
  some kinds of bugs will never show up in testing on that hardware,
  even when the threads reach the right state.  And testing is still
  the go-to solution for software people to find errors (nowadays even
  glorified by continuous integration and modern fuzz testing
  approaches).

The result is that a lot of software dealing with shared memory is
incorrect because it does not have a memory barrier that it should
have, or inefficient on cheap hardware with expensive memory barriers
because it uses more memory barriers than necessary for the memory
model.  A program may even be incorrect in one place and have
superflouous memory barriers in another one.

Or programmers just don't do this stuff at all (as advocated by
jseigh), and instead just write sequential programs, or use bottled
solutions that often are a lot more expensive than superfluous memory
barriers.  E.g., in Gforth the primary inter-thread communication
mechanism is currently implemented with pipes, involving the system
calls read() and write().  And Bernd Paysan who implemented that is a
really good programmer; I am sure he would be able to wrap his head
around the whole memory model stuff and implement something much more
efficient, but that would take time that he obviously prefers to spend
on more productive things.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>