Article <15cba985b2d2443f4e5a06b6d050d623@www.novabbs.org>

Deutsch English Français Italiano
<15cba985b2d2443f4e5a06b6d050d623@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Sun, 19 May 2024 20:52:03 +0000
Organization: Rocksolid Light
Message-ID: <15cba985b2d2443f4e5a06b6d050d623@www.novabbs.org>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me> <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me> <v2d9sv$3fda0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="1614325"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$jmrjyMkz9E3xMuDKuqJzdew5Eb5Ru1e.q/Nb5fpbNxcLvx3AeR3ze
X-Spam-Checker-Version: SpamAssassin 4.0.0
Bytes: 2573
Lines: 30

Terje Mathisen wrote:

> Thomas Koenig wrote:
>> So, I did some more measurements on the POWER9 machine, and it came
>> to around 18 cycles per FMA.  Compared to the 13 cycles for the
>> FMA instruction, this actually sounds reasonable.
>> 
>> The big problem appears to be that, in this particular
>> implementation, multiplication is not pipelined, but done by
>> piecewise by addition.  This can be explained by the fact that
>> this is mostly a decimal unit, with the 128-bit QP just added as
>> an afterthought, and decimal multiplication does not happen all
>> that often.
>> 
>> A fully pipelined FMA unit capable of 128-bit arithmetic would be
>> an entirely different beast, I would expect a throughput of 1 per
>> cycle and a latency of (maybe) one cycle more than 64-bit FMA.
>> 
> The FMA normalizer has to handle a maximally bad cancellation, so it 
> needs to be around 350 bits wide. Mitch knows of course but I'm guessing
> 
> that this could at least be close to needing an extra cycle on its own 
> and/or heroic hardware?

If you organize the multiplications and accumulations from most
significance
towards least significance, this wide effect is pipelined away, because
you initialize the accumulation with the augend and check for zero as
multiplies fall out of the tree.

> Terje