Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <20240519162333.00006023@yahoo.com>
Deutsch   English   Français   Italiano  
<20240519162333.00006023@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Sun, 19 May 2024 16:23:33 +0300
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <20240519162333.00006023@yahoo.com>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
	<memo.20240512203459.16164W@jgd.cix.co.uk>
	<v1rab7$2vt3u$1@dont-email.me>
	<20240513151647.0000403f@yahoo.com>
	<v1to2h$3km86$1@dont-email.me>
	<20240514221659.00001094@yahoo.com>
	<v234nr$12p27$1@dont-email.me>
	<20240516001628.00001031@yahoo.com>
	<v2cn4l$3bpov$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 19 May 2024 15:23:25 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7ca739330e1d452bcaf3fa4a81da6824";
	logging-data="3515959"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+uNzLVmdezbM5utZCcERLmZi6gw9uwxm0="
Cancel-Lock: sha1:XEdmMqP/nOIGUoxFr5/EtuzvUf0=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
Bytes: 4127

On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> So, I did some more measurements on the POWER9 machine, and it came
> to around 18 cycles per FMA.  Compared to the 13 cycles for the
> FMA instruction, this actually sounds reasonable.
> 

I.e. your actual running frequency was 3700 MHz?

> The big problem appears to be that, in this particular
> implementation, multiplication is not pipelined, but done by
> piecewise by addition.  This can be explained by the fact that
> this is mostly a decimal unit, with the 128-bit QP just added as
> an afterthought, and decimal multiplication does not happen all
> that often.
> 
> A fully pipelined FMA unit capable of 128-bit arithmetic would be
> an entirely different beast, I would expect a throughput of 1 per
> cycle and a latency of (maybe) one cycle more than 64-bit FMA.


There exists a middle ground between none-pipelined and fully pipelined
multiplier/FMA units. In fact, more than one middle ground.
Here the mid-middle ground that can imagine not being a real hardware
guy: 
1 - take a pair of exiting VSU multipliers. By now they can do
53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
2 - during quad-precision FMA split 113x113 multiplication into 4
pieces and run them through pair of multiplies each two at once.
That would produce all parts of 225-bit product at rate of 1 product
per 2 clocks
3 - build adders just sufficient for the same throughput of 1 result
per 2 clocks.
Such combined multiplier will have 2 clocks higher latency than DP
multiplier.
After that we'll need matching alignment and addition/subtraction
blocks, but by doing them half-pipelined we can utilize majority of
existing dual-DP hardware and would need very little else, except of
control signals and probably of new feedback data path on the upper
side of the adder. All that could cost us another clock of latency over
DP FMA, but not necessarily so.
Bottom line: QP FMA with throughput of 1 result per 2 clocks and
latency of 8 or 9 clocks.
For POWER8, that has less distributed VSU, such modification would be
somewhat easier than for POWER9.


That's what I call a mid-middle ground.
Low-middle ground would be leaving 53x53=>125bit multipliers
unmodified. 113x113 multiplication is split into 9 pieces and
product is delivered every 5 clocks.

High-middle ground is enhancing both VSU pipes and using them to
process two QP FMAs simultaneously for combined throughput equivalent
to fully pipelined.

Another possible high-middle ground is, again, enhancing both VSU pipes
and using them together on a single QP FMA. That would be potentially
best for latency, but does not fit well into philosophy of POWER9
design that tries to minimize high-speed interaction between various
pipes.