| Deutsch English Français Italiano |
|
<v7hbv3$3nb28$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thomas Koenig <tkoenig@netcologne.de> Newsgroups: comp.arch Subject: Re: Faster div or 1/sqrt approximations (was: Continuations) Date: Sat, 20 Jul 2024 21:58:59 -0000 (UTC) Organization: A noiseless patient Spider Lines: 101 Message-ID: <v7hbv3$3nb28$1@dont-email.me> References: <v6tbki$3g9rg$1@dont-email.me> <47689j5gbdg2runh3t7oq2thodmfkalno6@4ax.com> <v71vqu$gomv$9@dont-email.me> <116d9j5651mtjmq4bkjaheuf0pgpu6p0m8@4ax.com> <f8c6c5b5863ecfc1ad45bb415f0d2b49@www.novabbs.org> <7u7e9j5dthm94vb2vdsugngjf1cafhu2i4@4ax.com> <0f7b4deb1761f4c485d1dc3b21eb7cb3@www.novabbs.org> <v78soj$1tn73$1@dont-email.me> <v7dsf2$3139m$1@dont-email.me> <277c774f1eb48be79cd148dfc25c4367@www.novabbs.org> <v7ei4f$34uc2$1@dont-email.me> <20240721002344.00001da7@yahoo.com> Injection-Date: Sat, 20 Jul 2024 23:59:00 +0200 (CEST) Injection-Info: dont-email.me; posting-host="ccdff6c1e7e8e7cd4872288a041e1c0d"; logging-data="3910728"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18EV0EnvhnuWzCQY69K3GpcnCG6zJhklp8=" User-Agent: slrn/1.0.3 (Linux) Cancel-Lock: sha1:BWF1+WAiYoi+FZleCB0ZhJbbmeU= Bytes: 5156 Michael S <already5chosen@yahoo.com> schrieb: > On Fri, 19 Jul 2024 20:25:51 -0000 (UTC) > Thomas Koenig <tkoenig@netcologne.de> wrote: > >> MitchAlsup1 <mitchalsup@aol.com> schrieb: >> >> > I, personally, have found many Newton-Raphson iterators that >> > converge faster using 1/SQRT(x) than using the SQRT(x) equivalent. >> >> I can well believe that. >> >> It is interesting to see what different architectures offer for >> faster reciprocals. >> >> POWER has fre and fres (double and single version) for approximate >> divisin, which are accurate to 1/256. These operations are quite >> fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle >> so obviously fully pipelined. With 1/256 accuracy, this could >> actually be the original Quake algorithm (or its modification) >> with a single Newton step, but this is of course much better in >> hardware where exponent handling can be much simplified (and >> done only once). >> >> x86_64 has rcpss, accurate to 1/6144, with (looking at the >> instruction tables) 6 for newer architectures, with a throuhtput >> of 1/4. > > It seems, you looked at the wrong instruction table. [Note I was not writing about inverse squre root, I was writing about inverse]. I have to admit to being almost terminally confused by Intel generation names, so I am likely to mix up what is old and what is new. > Here are not the very modern x86-64 cores: > Arch Latency Throughput (scalar/128b/256b) > Zen3 3 2/2/1 > Skylake 4 1/1/1 > Ice Lake 4 1/1/1 > Power9 5-7 4/2/N/A Power9 has it for 128-bit, but not for 256 bits (it doesn't have those registers), and if I read the handbook correctly, that would also be 4 operations in parallel. > >> So, if your business depends on calculating many inaccurate >> square roots, fast, buy a POWER :-) >> > > If you are have enough of independent rsqrt to do, all four processors > have the same theoretical peak throughput, but x86 tend to have more > cores and to run at faster clock. And lower latency makes achieving > peak throughput easier. Also, depending on target precision, higher > initial precision of x86 estimate means that sometimes you can get away > with 1 less NR iteration. > > Also, if what you really need is sqrt rather than rsqrt, then depending > on how much inaccuracy you can accept, sometimes on modern x86 the > calculating accurate sqrt can be better solution than calculating > approximation. It is less likely to be the case on POWER9 Accurate sqrt [table reformatted, hope I got this right] > (single precision) > Zen3 14 0.20/0.200/0.200 > SkyLake 12 0.33/0.333/0.167 > Ice Lake 12 0.33/0.333/0.167 > Power9 26 0.20/0.095/N/A > > Accurate sqrt (double precision) > Zen3 20 0.111/0.111/0.111 > Skylake 12 0.167/0.167/0.083 > Ice Lake 12 0.167/0.167/0.083 > Power9 36 0.111/0.067/N/A > > >> Other architectures I have tried don't seem to have it. >> > > Arm64 has it. It is called FRSQRTE. Interesting that "gcc -O3 -ffast-meth -mrecip" does not appear to use it. > > >> Does it make sense? Well, if you want to calculate lots of Arrhenius >> equations, you don't need full accuracy and (like in Mitch's case) >> exp has become as fast as division, then it could actually make a >> lot of sense. It is still possible to add Newton steps afterwards, >> which is what gcc does if you add -mrecip -ffast-math. > > I don't know about POWER, but on x86 I wouldn't do it. > I'd either use plain division that on modern cores is quite fast > or will use NR to calculate normal reciprocal. x86 provides initial > estimate for that too (RCPSS). Note that I was talking about the inverse in the first place.