Article <v2g7js$4vi9$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v2g7js$4vi9$1@dont-email.me>
Deutsch English Français Italiano
<v2g7js$4vi9$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Mon, 20 May 2024 21:17:15 +0200
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <v2g7js$4vi9$1@dont-email.me>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
 <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me>
 <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me>
 <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me>
 <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me>
 <v2d9sv$3fda0$1@dont-email.me> <20240519203403.00003e9b@yahoo.com>
 <v2etr0$3s9r0$1@dont-email.me> <20240520113045.000050c5@yahoo.com>
 <v2ff99$3vq7q$1@dont-email.me> <20240520153630.00000b5a@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 20 May 2024 21:17:16 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2036112f32ea4bd7b1c4fb6c85600f70";
	logging-data="163401"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19dDUuW0fvWuvBy1+/moiIJFyiKKsM+8QLZuyzNhjvJCA=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:EDbM+mYUcRU/kUSi9hrLExFW/6U=
In-Reply-To: <20240520153630.00000b5a@yahoo.com>
Bytes: 6346

Michael S wrote:
> On Mon, 20 May 2024 14:22:00 +0200
> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> 
>> Michael S wrote:
>>> On Mon, 20 May 2024 09:24:16 +0200
>>> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>>>    
>>>> Michael S wrote:
>>>>> On Sun, 19 May 2024 18:37:51 +0200
>>>>> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>>>>>       
>>>>>> Thomas Koenig wrote:
>>>>>>> So, I did some more measurements on the POWER9 machine, and it
>>>>>>> came to around 18 cycles per FMA.  Compared to the 13 cycles for
>>>>>>> the FMA instruction, this actually sounds reasonable.
>>>>>>>
>>>>>>> The big problem appears to be that, in this particular
>>>>>>> implementation, multiplication is not pipelined, but done by
>>>>>>> piecewise by addition.  This can be explained by the fact that
>>>>>>> this is mostly a decimal unit, with the 128-bit QP just added as
>>>>>>> an afterthought, and decimal multiplication does not happen all
>>>>>>> that often.
>>>>>>>
>>>>>>> A fully pipelined FMA unit capable of 128-bit arithmetic would
>>>>>>> be an entirely different beast, I would expect a throughput of
>>>>>>> 1 per cycle and a latency of (maybe) one cycle more than 64-bit
>>>>>>> FMA.
>>>>>> The FMA normalizer has to handle a maximally bad cancellation, so
>>>>>> it needs to be around 350 bits wide. Mitch knows of course but
>>>>>> I'm guessing that this could at least be close to needing an
>>>>>> extra cycle on its own and/or heroic hardware?
>>>>>>
>>>>>> Terje
>>>>>>      
>>>>>
>>>>> Why so wide?
>>>>> Assuming that subnormal multiplier inputs are normalized before
>>>>
>>>> They are not, this is part of what you do to make subnormal numbers
>>>> exactly the same speed as normal inputs.
>>>>
>>>> Terje
>>>>   
>>>
>>> 1. I am not sure that "the same speed" is a worthy goal even for
>>> binary64 (for binary32 it is).
>>> 2. It's certainly does not sound like a worthy goal for binary128,
>>> where probability of encountering sub-normal inputs in real user
>>> code, rather than in test vector, is lower than DP by another order
>>> of magnitude,
>>> 3. Even if, for reason unclear to me, it is considered the goal, it
>>> can be achieved by introduction of one more pipeline stage
>>> everywhere. Since we are discussing high-latency design akin to
>>> POWER9, the relative cost of another stage would be lower. BTW,
>>> according to POWER9 manual, even for SP/DP FMA the latency is not
>>> constant. It varies from 5 to 7.
>>>
>>> So, IMHO, what you do to handle sub-normal inputs should depend on
>>> what ends up smaller or faster, not on some abstract principles.
>>> For less important unit, like binary128, 'smaller' would likely take
>>> relative precedence over 'faster'. It's possible that you'll end up
>>> with not doing pre-normalization, but the reason for it would be
>>> different from 'same speed'.
>>>
>>> Besides, pre-normalization vs wider post-normalization are not the
>>> only available choices. When multiplier is naturally segmented into
>>> 57-bit section, there exists, for example, an option of
>>> pre-normalization by full section. It looks very simple on the
>>> front and saves quite a lot of shifter's width on the back.
>>>
>>> But the best option is probably described in above post by Mitch.
>>> If I understood his post correctly, he suggests to have two
>>> alignment stages: one after multiplication and another one after
>>> add/sub. The shift count for a first stage is calculated from
>>> inputs in parallel with multiplication. The first alignment stage
>>> does not try to achieve a perfect normalizations, but it does
>>> enough for cutting the width of following adder from 3N to 2N+eps.
>>
>> I do agree with Mitch's suggestion: Allow subnormal inputs but do the
>> partial muls from the top and move the normalization starting point
>> down for each all-zero input block.
>>
>> In an extreme case (subnormal x subnormal) this would allow you to
>> discard a lot of partial products.
>>
>> Terje
>>
> 
> For subnormal x subnormal you don't need result of multiplication at
> all. All you need to know is if it's zero or not and what sign.
> Even that is needed only in non-default rounding modes and for inexact
> flag in default mode.

Yeah, Mea Culpa! I did correct that particular brain fart a few minutes 
later in my subsequent post, but it is not possible for the 
multiplication to produce a result far below the subnormal limit.

As you note, it is only when using RoundToPlus (or Minus) Infinity that 
an arbitrary small product can still produce a non-zero result.

Terje


-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"