Article <c5bd7e224b89fb92ab3bcf20a4ec05b3@www.novabbs.org>

Deutsch English Français Italiano
<c5bd7e224b89fb92ab3bcf20a4ec05b3@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: binary128 implementation
Date: Thu, 23 May 2024 21:39:12 +0000
Organization: Rocksolid Light
Message-ID: <c5bd7e224b89fb92ab3bcf20a4ec05b3@www.novabbs.org>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me> <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me> <v2d9sv$3fda0$1@dont-email.me> <20240519203403.00003e9b@yahoo.com> <2024May20.125648@mips.complang.tuwien.ac.at> <v2ffm2$3vs0t$1@dont-email.me> <v2o76h$1thju$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="2015621"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$ApzqQ/FP05Iq5xxUOic.Eu7pj5DcOeYCPsuhIfCJRUZ0V/FPdKdCK
X-Spam-Checker-Version: SpamAssassin 4.0.0
Bytes: 5941
Lines: 128

BGB-Alt wrote:

> On 5/20/2024 7:28 AM, Terje Mathisen wrote:
>> Anton Ertl wrote:
>>> Michael S <already5chosen@yahoo.com> writes:
>>>> On Sun, 19 May 2024 18:37:51 +0200
>>>> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>>>>> The FMA normalizer has to handle a maximally bad cancellation, so it
>>>>> needs to be around 350 bits wide. Mitch knows of course but I'm
>>>>> guessing that this could at least be close to needing an extra cycle
>>>>> on its own and/or heroic hardware?
>>>>>
>>>>> Terje
>>>>>
>>>>
>>>> Why so wide?
>>>> Assuming that subnormal multiplier inputs are normalized before
>>>> multiplication, the product of multiplication is 226 bits
>>>
>>> The product of the mantissa multiplication is at most 226 bits even if
>>> you don't normalize subnormal numbers.  For cancellation to play a
>>> role the addend has to be close in absolute value and have the
>>> opposite sign as the product, so at most one additional bit comes into
>>> play for that case (for something like the product being
>>> 0111111... and the addend being -10000000...).
>> 
>> This is the part of Mitch's explanation that I have never been able to 
>> totally grok, I do think you could get away with less bits, but only if
>> 
>> you can collapse the extra mantissa bits into sticky while aligning the
>> 
>> product with the addend. If that takes too long or it turns out to be 
>> easier/faster in hardware to simply work with a much wider mantissa, 
>> then I'll accept that.
>> 
>> I don't think I've ever seen Mitch make a mistake on anything like
>> this!
>> 

> It is a mystery, though seems like maybe Binary128 FMA could be done in
> 
> software via an internal 384-bit intermediate?...

> My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to 
> 228 bits. If one adds another 116 bits (for maximal FADD), this comes
> to
> 
> 344.

Maximal product with minimal augend::

     pppppppp-pppppppp-aaaaaaaa

Maximal augend with minimal product

     aaaaaaaa-pppppppp-pppppppp

So the way one builds HW is to have the augend shifter cover the whole
4×
length and place the product in the middle::

        max                        min
     aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
              pppppppp-pppppppp

The output of the product is still in carry-save form and the augend is
in pure binary so the adder is 3-input for 2×-width. This generates a
carry into the high order incrementor.

So one has a sticky generator for the right hand side augend, and an
incrementor for the left hand side augend. When doing high speed de-
normals one cannot count on the left hand side of product to have HoBs
set with standard ramifications (imaging a denorm product and a denorm
augend and you want the right answer.)

Any way you cook it, you have a 4× wide intermediate (minus 2-bits
IIRC).
4×112 = 448 -2 = 446. 

There is a reason these things are not standard at this point of
technology.

Could you do it (IEEE accuracy) with less HW--yes, but only if you
allow
certain special cases to take more cycles in calculation. At a certain
point (a point made by Terje) it is easier to implement with wide
integer
calculations 128+128 and/or 128*128 along with double width shifts,
inserts,
and extracts.

IEEE did not make these things any easier by having a 2× std width
fraction
have 2×+3 bits of length requiring 8 multiplications with minimal HW
instead
of 4 multiplications. On the other hand IBM did us no favors with Hex
FP
either (keeping the exponent size the same and having 2×+8 bits of
fraction.)

> In this case, 384 bits would be because my "_BitInt" support code pads 
> things to a multiple of 128 bits (for integer types larger than 256
> bits).


> It isn't fast, but I am not against having Binary128 being slower,
> since
> 
> if one is using Binary128 ("long double" or "__float128" in this case),
> 
> it is likely the case that precision is more a priority than speed.

> Though, as of yet, there is no Binary128 FMA operation (in the software
> 
> runtime). Could potentially add this in theory.


> I guess, maybe also possible could be whether to add the 
> FADDX/FMULX/FMACX instructions in a form where they are allowed, but 
> will be turned into runtime traps (would likely route them through the 
> TLB Miss ISR, which thus far has ended up as a catch-all for this sort 
> of thing...).

> Though, likely more efficient would still be "just use the runtime
> calls".

>> Terje
>> 
>>