Article <v7hbv3$3nb28$1@dont-email.me>

Deutsch English Français Italiano
<v7hbv3$3nb28$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thomas Koenig <tkoenig@netcologne.de>
Newsgroups: comp.arch
Subject: Re: Faster div or 1/sqrt approximations (was: Continuations)
Date: Sat, 20 Jul 2024 21:58:59 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 101
Message-ID: <v7hbv3$3nb28$1@dont-email.me>
References: <v6tbki$3g9rg$1@dont-email.me>
 <47689j5gbdg2runh3t7oq2thodmfkalno6@4ax.com> <v71vqu$gomv$9@dont-email.me>
 <116d9j5651mtjmq4bkjaheuf0pgpu6p0m8@4ax.com>
 <f8c6c5b5863ecfc1ad45bb415f0d2b49@www.novabbs.org>
 <7u7e9j5dthm94vb2vdsugngjf1cafhu2i4@4ax.com>
 <0f7b4deb1761f4c485d1dc3b21eb7cb3@www.novabbs.org>
 <v78soj$1tn73$1@dont-email.me> <v7dsf2$3139m$1@dont-email.me>
 <277c774f1eb48be79cd148dfc25c4367@www.novabbs.org>
 <v7ei4f$34uc2$1@dont-email.me> <20240721002344.00001da7@yahoo.com>
Injection-Date: Sat, 20 Jul 2024 23:59:00 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ccdff6c1e7e8e7cd4872288a041e1c0d";
	logging-data="3910728"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18EV0EnvhnuWzCQY69K3GpcnCG6zJhklp8="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:BWF1+WAiYoi+FZleCB0ZhJbbmeU=
Bytes: 5156

Michael S <already5chosen@yahoo.com> schrieb:
> On Fri, 19 Jul 2024 20:25:51 -0000 (UTC)
> Thomas Koenig <tkoenig@netcologne.de> wrote:
>
>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>> 
>> > I, personally, have found many Newton-Raphson iterators that
>> > converge faster using 1/SQRT(x) than using the SQRT(x) equivalent.  
>> 
>> I can well believe that.
>> 
>> It is interesting to see what different architectures offer for
>> faster reciprocals.
>> 
>> POWER has fre and fres (double and single version) for approximate
>> divisin, which are accurate to 1/256.  These operations are quite
>> fast, 4 to 7 cycles on POWER9, with up to 4 instructions per cycle
>> so obviously fully pipelined.  With 1/256 accuracy, this could
>> actually be the original Quake algorithm (or its modification)
>> with a single Newton step, but this is of course much better in
>> hardware where exponent handling can be much simplified (and
>> done only once).
>> 
>> x86_64 has rcpss, accurate to 1/6144, with (looking at the
>> instruction tables) 6 for newer architectures, with a throuhtput
>> of 1/4.  
>
> It seems, you looked at the wrong instruction table.

[Note I was not writing about inverse squre root, I was writing
about inverse].

I have to admit to being almost terminally confused by Intel
generation names, so I am likely to mix up what is old and what
is new.

> Here are not the very modern x86-64 cores:
> Arch     Latency Throughput (scalar/128b/256b)
> Zen3      3       2/2/1
> Skylake   4       1/1/1
> Ice Lake  4       1/1/1
> Power9    5-7     4/2/N/A

Power9 has it for 128-bit, but not for 256 bits (it doesn't have
those registers), and if I read the handbook correctly, that
would also be 4 operations in parallel.

>
>> So, if your business depends on calculating many inaccurate
>> square roots, fast, buy a POWER :-)
>> 
>
> If you are have enough of independent rsqrt to do, all four processors
> have the same theoretical peak throughput, but x86 tend to have more
> cores and to run at faster clock. And lower latency makes achieving
> peak throughput easier. Also, depending on target precision, higher
> initial precision of x86 estimate means that sometimes you can get away
> with 1 less NR iteration.
>
> Also, if what you really need is sqrt rather than rsqrt, then depending
> on how much inaccuracy you can accept, sometimes on modern x86 the
> calculating accurate sqrt can be better solution than calculating
> approximation. It is less likely to be the case on POWER9 Accurate sqrt

[table reformatted, hope I got this right]

> (single precision)
> Zen3      14      0.20/0.200/0.200
> SkyLake   12      0.33/0.333/0.167
> Ice Lake  12      0.33/0.333/0.167
> Power9    26      0.20/0.095/N/A
>
> Accurate sqrt (double precision)
> Zen3      20      0.111/0.111/0.111
> Skylake   12      0.167/0.167/0.083
> Ice Lake  12      0.167/0.167/0.083
> Power9    36      0.111/0.067/N/A
>
>
>> Other architectures I have tried don't seem to have it.
>> 
>
> Arm64 has it. It is called FRSQRTE.

Interesting that "gcc -O3 -ffast-meth -mrecip" does not
appear to use it.

>
>
>> Does it make sense? Well, if you want to calculate lots of Arrhenius
>> equations, you don't need full accuracy and (like in Mitch's case)
>> exp has become as fast as division, then it could actually make a
>> lot of sense.  It is still possible to add Newton steps afterwards,
>> which is what gcc does if you add -mrecip -ffast-math.
>
> I don't know about POWER, but on x86 I wouldn't do it.
> I'd either use plain division that on modern cores is quite fast
> or will use NR to calculate normal reciprocal. x86 provides initial
> estimate for that too (RCPSS).

Note that I was talking about the inverse in the first place.