Deutsch English Français Italiano |
<v2ff99$3vq7q$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen <terje.mathisen@tmsw.no> Newsgroups: comp.arch Subject: Re: Making Lemonade (Floating-point format changes) Date: Mon, 20 May 2024 14:22:00 +0200 Organization: A noiseless patient Spider Lines: 87 Message-ID: <v2ff99$3vq7q$1@dont-email.me> References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me> <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me> <v2d9sv$3fda0$1@dont-email.me> <20240519203403.00003e9b@yahoo.com> <v2etr0$3s9r0$1@dont-email.me> <20240520113045.000050c5@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 20 May 2024 14:22:02 +0200 (CEST) Injection-Info: dont-email.me; posting-host="2442f757fe0e90d7c629db09088092de"; logging-data="4188410"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19rlcHyKulCh452cKZzTyYIKk78Hr7sWFcIqBNGtcQXqw==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2 Cancel-Lock: sha1:mDMmz/RaDGwMj4ZILTeqxtqKqSU= In-Reply-To: <20240520113045.000050c5@yahoo.com> Bytes: 5394 Michael S wrote: > On Mon, 20 May 2024 09:24:16 +0200 > Terje Mathisen <terje.mathisen@tmsw.no> wrote: > >> Michael S wrote: >>> On Sun, 19 May 2024 18:37:51 +0200 >>> Terje Mathisen <terje.mathisen@tmsw.no> wrote: >>> >>>> Thomas Koenig wrote: >>>>> So, I did some more measurements on the POWER9 machine, and it >>>>> came to around 18 cycles per FMA. Compared to the 13 cycles for >>>>> the FMA instruction, this actually sounds reasonable. >>>>> >>>>> The big problem appears to be that, in this particular >>>>> implementation, multiplication is not pipelined, but done by >>>>> piecewise by addition. This can be explained by the fact that >>>>> this is mostly a decimal unit, with the 128-bit QP just added as >>>>> an afterthought, and decimal multiplication does not happen all >>>>> that often. >>>>> >>>>> A fully pipelined FMA unit capable of 128-bit arithmetic would be >>>>> an entirely different beast, I would expect a throughput of 1 per >>>>> cycle and a latency of (maybe) one cycle more than 64-bit FMA. >>>>> >>>> The FMA normalizer has to handle a maximally bad cancellation, so >>>> it needs to be around 350 bits wide. Mitch knows of course but I'm >>>> guessing that this could at least be close to needing an extra >>>> cycle on its own and/or heroic hardware? >>>> >>>> Terje >>>> >>> >>> Why so wide? >>> Assuming that subnormal multiplier inputs are normalized before >> >> They are not, this is part of what you do to make subnormal numbers >> exactly the same speed as normal inputs. >> >> Terje >> > > 1. I am not sure that "the same speed" is a worthy goal even for > binary64 (for binary32 it is). > 2. It's certainly does not sound like a worthy goal for binary128, > where probability of encountering sub-normal inputs in real user code, > rather than in test vector, is lower than DP by another order of > magnitude, > 3. Even if, for reason unclear to me, it is considered the goal, it can > be achieved by introduction of one more pipeline stage everywhere. > Since we are discussing high-latency design akin to POWER9, the > relative cost of another stage would be lower. BTW, according to POWER9 > manual, even for SP/DP FMA the latency is not constant. It varies from > 5 to 7. > > So, IMHO, what you do to handle sub-normal inputs should depend on what > ends up smaller or faster, not on some abstract principles. For less > important unit, like binary128, 'smaller' would likely take > relative precedence over 'faster'. It's possible that you'll end up > with not doing pre-normalization, but the reason for it would be > different from 'same speed'. > > Besides, pre-normalization vs wider post-normalization are not the only > available choices. When multiplier is naturally segmented into 57-bit > section, there exists, for example, an option of pre-normalization by > full section. It looks very simple on the front and saves quite a lot > of shifter's width on the back. > > But the best option is probably described in above post by Mitch. If I > understood his post correctly, he suggests to have two alignment stages: > one after multiplication and another one after add/sub. The shift count > for a first stage is calculated from inputs in parallel with > multiplication. The first alignment stage does not try to achieve a > perfect normalizations, but it does enough for cutting the width of > following adder from 3N to 2N+eps. I do agree with Mitch's suggestion: Allow subnormal inputs but do the partial muls from the top and move the normalization starting point down for each all-zero input block. In an extreme case (subnormal x subnormal) this would allow you to discard a lot of partial products. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"