Article <v2dg6i$3go87$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v2dg6i$3go87$1@dont-email.me>

Deutsch English Français Italiano

<v2dg6i$3go87$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Sun, 19 May 2024 13:25:19 -0500
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <v2dg6i$3go87$1@dont-email.me>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
 <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me>
 <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me>
 <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me>
 <20240516001628.00001031@yahoo.com> <v2cn4l$3bpov$1@dont-email.me>
 <v2d9sv$3fda0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 19 May 2024 20:25:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2bca96f36ce19308043bc1aac1248b10";
	logging-data="3694855"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19lH46Rt42zjdUbJTvtK6yPpFDOXpDsnS0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YeGW1iOS+p3C+9+ltgouvSrbJeo=
Content-Language: en-US
In-Reply-To: <v2d9sv$3fda0$1@dont-email.me>
Bytes: 3883

On 5/19/2024 11:37 AM, Terje Mathisen wrote:
> Thomas Koenig wrote:
>> So, I did some more measurements on the POWER9 machine, and it came
>> to around 18 cycles per FMA.  Compared to the 13 cycles for the
>> FMA instruction, this actually sounds reasonable.
>>
>> The big problem appears to be that, in this particular
>> implementation, multiplication is not pipelined, but done by
>> piecewise by addition.  This can be explained by the fact that
>> this is mostly a decimal unit, with the 128-bit QP just added as
>> an afterthought, and decimal multiplication does not happen all
>> that often.
>>
>> A fully pipelined FMA unit capable of 128-bit arithmetic would be
>> an entirely different beast, I would expect a throughput of 1 per
>> cycle and a latency of (maybe) one cycle more than 64-bit FMA.
>>
> The FMA normalizer has to handle a maximally bad cancellation, so it 
> needs to be around 350 bits wide. Mitch knows of course but I'm guessing 
> that this could at least be close to needing an extra cycle on its own 
> and/or heroic hardware?
> 

This sort of thing is part of what makes proper FMA hopelessly 
expensive. Granted, full FMA also allows faking higher precision using 
SIMD vector operations, with math that does not work with double-rounded 
FMA instructions.

Well, and also an issue if one can "just barely" afford to have a single 
double-precision unit.

Though, the trick of possibly having four 27-bit multiplies which 
combine into a virtual 54 bit multiplier seems like an interesting 
possibility, though not great as DSP's don't natively handle this size 
(and would be too expensive to stretch it out with LUTs). Likely, one 
would need to build it from 34*34->68 bit multipliers (each costing 4 DSPs).

In terms of DSP cost, it would be higher than the current solution:
   16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64 
multipliers, the shortfall is made up using smaller LUT-based multipliers).

Though, with the combiner option, one could make a case for, say, a:
   S.E15.F66.Z46 format (Z=zeroed/ignored).

Well, and/or accept the wonk of a Binary128 which produces 112 bits of 
mantissa, but only uses the high 66 bits or so, but generally this was 
worse for some things in some tests than one which simply zeroes the 
low-order bits.

But, OTOH, 66*66->112 would allow for possible trickery to fake a full 
Binary128 FMUL in software as a multi-part process (when combined with a 
Binary128 FADD).

....

> Terje
>