Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Making Lemonade (Floating-point format changes) Date: Sun, 19 May 2024 20:52:03 +0000 Organization: Rocksolid Light Message-ID: <15cba985b2d2443f4e5a06b6d050d623@www.novabbs.org> References: <20240513151647.0000403f@yahoo.com> <20240514221659.00001094@yahoo.com> <20240516001628.00001031@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="1614325"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Rslight-Site: $2y$10$jmrjyMkz9E3xMuDKuqJzdew5Eb5Ru1e.q/Nb5fpbNxcLvx3AeR3ze X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 2573 Lines: 30 Terje Mathisen wrote: > Thomas Koenig wrote: >> So, I did some more measurements on the POWER9 machine, and it came >> to around 18 cycles per FMA. Compared to the 13 cycles for the >> FMA instruction, this actually sounds reasonable. >> >> The big problem appears to be that, in this particular >> implementation, multiplication is not pipelined, but done by >> piecewise by addition. This can be explained by the fact that >> this is mostly a decimal unit, with the 128-bit QP just added as >> an afterthought, and decimal multiplication does not happen all >> that often. >> >> A fully pipelined FMA unit capable of 128-bit arithmetic would be >> an entirely different beast, I would expect a throughput of 1 per >> cycle and a latency of (maybe) one cycle more than 64-bit FMA. >> > The FMA normalizer has to handle a maximally bad cancellation, so it > needs to be around 350 bits wide. Mitch knows of course but I'm guessing > > that this could at least be close to needing an extra cycle on its own > and/or heroic hardware? If you organize the multiplications and accumulations from most significance towards least significance, this wide effect is pipelined away, because you initialize the accumulation with the augend and check for zero as multiplies fall out of the tree. > Terje