Article <20240520113045.000050c5@yahoo.com>

Deutsch English Français Italiano
<20240520113045.000050c5@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Mon, 20 May 2024 11:30:45 +0300
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <20240520113045.000050c5@yahoo.com>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
	<memo.20240512203459.16164W@jgd.cix.co.uk>
	<v1rab7$2vt3u$1@dont-email.me>
	<20240513151647.0000403f@yahoo.com>
	<v1to2h$3km86$1@dont-email.me>
	<20240514221659.00001094@yahoo.com>
	<v234nr$12p27$1@dont-email.me>
	<20240516001628.00001031@yahoo.com>
	<v2cn4l$3bpov$1@dont-email.me>
	<v2d9sv$3fda0$1@dont-email.me>
	<20240519203403.00003e9b@yahoo.com>
	<v2etr0$3s9r0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 20 May 2024 10:30:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7bfcfeaa9dd4fa5ca6e8a5579daf5a00";
	logging-data="4081932"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18mX+oiMMtk6t+ZhVDnLktTBcsKOEZLgoY="
Cancel-Lock: sha1:J7Ikpee0ab2WgsffHQLZuXMCwQk=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
Bytes: 4758

On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Michael S wrote:
> > On Sun, 19 May 2024 18:37:51 +0200
> > Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> >   
> >> Thomas Koenig wrote:  
> >>> So, I did some more measurements on the POWER9 machine, and it
> >>> came to around 18 cycles per FMA.  Compared to the 13 cycles for
> >>> the FMA instruction, this actually sounds reasonable.
> >>>
> >>> The big problem appears to be that, in this particular
> >>> implementation, multiplication is not pipelined, but done by
> >>> piecewise by addition.  This can be explained by the fact that
> >>> this is mostly a decimal unit, with the 128-bit QP just added as
> >>> an afterthought, and decimal multiplication does not happen all
> >>> that often.
> >>>
> >>> A fully pipelined FMA unit capable of 128-bit arithmetic would be
> >>> an entirely different beast, I would expect a throughput of 1 per
> >>> cycle and a latency of (maybe) one cycle more than 64-bit FMA.
> >>>      
> >> The FMA normalizer has to handle a maximally bad cancellation, so
> >> it needs to be around 350 bits wide. Mitch knows of course but I'm
> >> guessing that this could at least be close to needing an extra
> >> cycle on its own and/or heroic hardware?
> >>
> >> Terje
> >>  
> > 
> > Why so wide?
> > Assuming that subnormal multiplier inputs are normalized before  
> 
> They are not, this is part of what you do to make subnormal numbers 
> exactly the same speed as normal inputs.
> 
> Terje
> 

1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user code,
rather than in test vector, is lower than DP by another order of
magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it can
be achieved by introduction of one more pipeline stage everywhere.
Since we are discussing high-latency design akin to POWER9, the
relative cost of another stage would be lower. BTW, according to POWER9
manual, even for SP/DP FMA the latency is not constant. It varies from
5 to 7.

So, IMHO, what you do to handle sub-normal inputs should depend on what
ends up smaller or faster, not on some abstract principles. For less
important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.

Besides, pre-normalization vs wider post-normalization are not the only
available choices. When multiplier is naturally segmented into 57-bit
section, there exists, for example, an option of pre-normalization by
full section. It looks very simple on the front and saves quite a lot
of shifter's width on the back.

But the best option is probably described in above post by Mitch. If I
understood his post correctly, he suggests to have two alignment stages:
one after multiplication and another one after add/sub. The shift count
for a first stage is calculated from inputs in parallel with
multiplication. The first alignment stage does not try to achieve a
perfect normalizations, but it does enough for cutting the width of
following adder from 3N to 2N+eps.