Article <v234nr$12p27$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v234nr$12p27$1@dont-email.me>

Deutsch English Français Italiano

<v234nr$12p27$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thomas Koenig <tkoenig@netcologne.de>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Wed, 15 May 2024 20:08:27 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 29
Message-ID: <v234nr$12p27$1@dont-email.me>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
 <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me>
 <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me>
 <20240514221659.00001094@yahoo.com>
Injection-Date: Wed, 15 May 2024 22:08:28 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1d1f0c878087e3f61d225c268691d60c";
	logging-data="1139783"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19l+gFcxCSvDcU4K42JQxHjBV0UI6NWgzI="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:PD5Zh0FkkUGvGFqWjy/V0TEBdPA=
Bytes: 2305

Michael S <already5chosen@yahoo.com> schrieb:

> IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix
> multiplication benchmark running on a single POWER9 core.

Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.

> I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
> GHz) using my plug-in replacements for gcc __multf3/__addtf3

Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four.  Not too bad, actually.

[..]
>> I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
>> with one result per cycle, POWER10 has 12 to 13 cycles with two
>> results per cycle.
>
> So, a bottleneck is somewhere else. May be, multiplication?

I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously.  On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles.  Hm,
apparently somebody didn't think it was all that important,
apparently :-(