Deutsch English Français Italiano |
<c5bd7e224b89fb92ab3bcf20a4ec05b3@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: binary128 implementation Date: Thu, 23 May 2024 21:39:12 +0000 Organization: Rocksolid Light Message-ID: <c5bd7e224b89fb92ab3bcf20a4ec05b3@www.novabbs.org> References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me> <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me> <v2d9sv$3fda0$1@dont-email.me> <20240519203403.00003e9b@yahoo.com> <2024May20.125648@mips.complang.tuwien.ac.at> <v2ffm2$3vs0t$1@dont-email.me> <v2o76h$1thju$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="2015621"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Rslight-Site: $2y$10$ApzqQ/FP05Iq5xxUOic.Eu7pj5DcOeYCPsuhIfCJRUZ0V/FPdKdCK X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 5941 Lines: 128 BGB-Alt wrote: > On 5/20/2024 7:28 AM, Terje Mathisen wrote: >> Anton Ertl wrote: >>> Michael S <already5chosen@yahoo.com> writes: >>>> On Sun, 19 May 2024 18:37:51 +0200 >>>> Terje Mathisen <terje.mathisen@tmsw.no> wrote: >>>>> The FMA normalizer has to handle a maximally bad cancellation, so it >>>>> needs to be around 350 bits wide. Mitch knows of course but I'm >>>>> guessing that this could at least be close to needing an extra cycle >>>>> on its own and/or heroic hardware? >>>>> >>>>> Terje >>>>> >>>> >>>> Why so wide? >>>> Assuming that subnormal multiplier inputs are normalized before >>>> multiplication, the product of multiplication is 226 bits >>> >>> The product of the mantissa multiplication is at most 226 bits even if >>> you don't normalize subnormal numbers. For cancellation to play a >>> role the addend has to be close in absolute value and have the >>> opposite sign as the product, so at most one additional bit comes into >>> play for that case (for something like the product being >>> 0111111... and the addend being -10000000...). >> >> This is the part of Mitch's explanation that I have never been able to >> totally grok, I do think you could get away with less bits, but only if >> >> you can collapse the extra mantissa bits into sticky while aligning the >> >> product with the addend. If that takes too long or it turns out to be >> easier/faster in hardware to simply work with a much wider mantissa, >> then I'll accept that. >> >> I don't think I've ever seen Mitch make a mistake on anything like >> this! >> > It is a mystery, though seems like maybe Binary128 FMA could be done in > > software via an internal 384-bit intermediate?... > My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to > 228 bits. If one adds another 116 bits (for maximal FADD), this comes > to > > 344. Maximal product with minimal augend:: pppppppp-pppppppp-aaaaaaaa Maximal augend with minimal product aaaaaaaa-pppppppp-pppppppp So the way one builds HW is to have the augend shifter cover the whole 4× length and place the product in the middle:: max min aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa pppppppp-pppppppp The output of the product is still in carry-save form and the augend is in pure binary so the adder is 3-input for 2×-width. This generates a carry into the high order incrementor. So one has a sticky generator for the right hand side augend, and an incrementor for the left hand side augend. When doing high speed de- normals one cannot count on the left hand side of product to have HoBs set with standard ramifications (imaging a denorm product and a denorm augend and you want the right answer.) Any way you cook it, you have a 4× wide intermediate (minus 2-bits IIRC). 4×112 = 448 -2 = 446. There is a reason these things are not standard at this point of technology. Could you do it (IEEE accuracy) with less HW--yes, but only if you allow certain special cases to take more cycles in calculation. At a certain point (a point made by Terje) it is easier to implement with wide integer calculations 128+128 and/or 128*128 along with double width shifts, inserts, and extracts. IEEE did not make these things any easier by having a 2× std width fraction have 2×+3 bits of length requiring 8 multiplications with minimal HW instead of 4 multiplications. On the other hand IBM did us no favors with Hex FP either (keeping the exponent size the same and having 2×+8 bits of fraction.) > In this case, 384 bits would be because my "_BitInt" support code pads > things to a multiple of 128 bits (for integer types larger than 256 > bits). > It isn't fast, but I am not against having Binary128 being slower, > since > > if one is using Binary128 ("long double" or "__float128" in this case), > > it is likely the case that precision is more a priority than speed. > Though, as of yet, there is no Binary128 FMA operation (in the software > > runtime). Could potentially add this in theory. > I guess, maybe also possible could be whether to add the > FADDX/FMULX/FMACX instructions in a form where they are allowed, but > will be turned into runtime traps (would likely route them through the > TLB Miss ISR, which thus far has ended up as a catch-all for this sort > of thing...). > Though, likely more efficient would still be "just use the runtime > calls". >> Terje >> >>