| Deutsch English Français Italiano |
|
<vin2rp$3ofc$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: jseigh <jseigh_es00@xemaps.com> Newsgroups: comp.arch Subject: Re: Memory ordering Date: Tue, 3 Dec 2024 08:59:18 -0500 Organization: A noiseless patient Spider Lines: 85 Message-ID: <vin2rp$3ofc$1@dont-email.me> References: <vfono1$14l9r$1@dont-email.me> <vh4530$2mar5$1@dont-email.me> <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com> <vh5t5b$312cl$2@dont-email.me> <5yqdnU9eL_Y_GKv6nZ2dnZfqn_GdnZ2d@supernews.com> <2024Nov15.082512@mips.complang.tuwien.ac.at> <vh7ak1$3cm56$1@dont-email.me> <20241115152459.00004c86@yahoo.com> <vh8bn7$3j6ql$1@dont-email.me> <vhb2dc$73fe$1@dont-email.me> <vhct2q$lk1b$2@dont-email.me> <2024Nov17.161752@mips.complang.tuwien.ac.at> <vhh16e$1lp5h$1@dont-email.me> <2024Dec3.100144@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 03 Dec 2024 14:59:21 +0100 (CET) Injection-Info: dont-email.me; posting-host="e736505633d7c60fe756f6fe2691088a"; logging-data="123372"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0yGN6PQtbuWNeqvU8OugU" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:+a2FTm+rbcX2DG/5IfBSVlCINrI= Content-Language: en-US In-Reply-To: <2024Dec3.100144@mips.complang.tuwien.ac.at> Bytes: 5285 On 12/3/24 04:01, Anton Ertl wrote: > "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes: >> On 11/17/2024 7:17 AM, Anton Ertl wrote: >>> jseigh <jseigh_es00@xemaps.com> writes: >>>> Or maybe disable reordering or optimization altogether >>>> for those target architectures. >>> >>> So you want to throw out the baby with the bathwater. >> >> No, keep the weak order systems and not throw them out wrt a system that >> is 100% seq_cst? Perhaps? What am I missing here? > > Disabling optimization altogether costs a lot; e.g., look at > <http://www.complang.tuwien.ac.at/anton/bentley.pdf>: if you compare > the lines for clang-3.5 -O0 with clang-3.5 -O3, you see a factor >2.5 > for the tsp9 program. For gcc-5.2.0 the difference is even bigger. > > That's why jseigh and people like him (I have read that suggestion > several times before) love to suggest disabling optimization > altogether. It's a straw man that does not even need beating up. Of > course they usually don't show results for the supposed benefits of > the particular "optimization" they advocate (or the drawbacks of > disabling it), and jseigh follows this pattern nicely. > That wasn't a serious suggestion. The compiler is allow to reorder code as long as it knows the reordering can't be observed or detected. If there are places in the code it doesn't know this can't happen it won't optimize across it, more or less. If you are writing code with concurrent shared data access then you need let the compiler know. One way is with locks. Another way for lock-free data structures with with memory barriers. Even if you had cst hardware you still need to tell the compiler so cst hardware doesn't buy you any less effort from a programming point of view. If you are arguing lock-free programming with memory barrriers is hard, let's use locks for everything (disregarding that locks have acquire/release semantics that the compiler has to be aware of and programmers aren't always aware of), you might want to consider the following performance timings on some stuff I've been playing with. unsafe 53.344 nsecs ( 0.000) 54.547 nsecs ( 0.000)* smr 53.828 nsecs ( 0.484) 55.485 nsecs ( 0.939) smrlite 53.094 nsecs ( 0.000) 54.329 nsecs ( 0.000) arc 306.674 nsecs ( 253.330) 313.931 nsecs ( 259.384) rwlock 730.012 nsecs ( 676.668) 830.340 nsecs ( 775.793) mutex 2,881.690 nsecs ( 2,828.346) 3,305.382 nsecs ( 3,250.835) smr is smrproxy, something like user space rcu. smrlite is smr is smr w/o thread_local access so I have an idea how much that adds to overhead. arc is arcproxy, lock-free reference count based deferred reclamation. rwlock and mutex are what their names would suggest. unsafe is no synchronization to get a base timing on the reader loop body. 2nd col is per loop read lock/unlock average cpu time 3rd col is with unsafe time subtracted out 4th col is average elapsed time 5th col is with unsafe time subtracted out. cpu time doesn't measure lock wait time so elapsed time gives some indication of that. 8 reader threads, 1 writer thread smrproxy is the version that doesn't need the cst_seq memory barrier so it is pretty fast (you are welcome). arc, rwlock, and mutex use interlocked instructions which cause cache thrashing. mutex will not scale well with number of threads on top of that. rwlock depends on how much write locking is going on. With few write updates, it will look more like arc. Timings are for 8 reader threads, 1 writer thread on 4 core/8 hw thread machine. There's going to be applications where that 2 to 3+ order difference of overhead is going to matter a lot. Joe Seigh