Article <20240516001628.00001031@yahoo.com>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <20240516001628.00001031@yahoo.com>

Deutsch English Français Italiano

<20240516001628.00001031@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Thu, 16 May 2024 00:16:28 +0300
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <20240516001628.00001031@yahoo.com>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
	<memo.20240512203459.16164W@jgd.cix.co.uk>
	<v1rab7$2vt3u$1@dont-email.me>
	<20240513151647.0000403f@yahoo.com>
	<v1to2h$3km86$1@dont-email.me>
	<20240514221659.00001094@yahoo.com>
	<v234nr$12p27$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 15 May 2024 23:16:33 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5bb281324a08f451b7dc707065bf0b6d";
	logging-data="1114521"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/1+Cgq8faS10o+Pl3ZZqKkU2Pxsotrd6w="
Cancel-Lock: sha1:Ok3MWih62Ceb9MfAsC7Y0nwvxeA=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 4289

On Wed, 15 May 2024 20:08:27 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> Michael S <already5chosen@yahoo.com> schrieb:
> 
> > IIRC, you reported something like 200 (or 300?) MFLOPS for your
> > matrix multiplication benchmark running on a single POWER9 core.  
> 

Not too bad. Not too good, either.

> Just reran the tests, it gave me somewhere around 405-410 MFlops
> on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
> This is with the standard gfortran matmul routine.
> 

I don't think that nowadays /proc/cpuinfo has any relationship to
actual frequency. Most likely with a single core active even the
cheapest POWER9 SKU runs at 3.8 GHz.
If there is no ready-made utility, you can measure it by yourself with
latency-bound loop. Just don't forget that on POWER9 all simple integer
opcodes have latency=2.
If there are any difficulties, I can help.

> > I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
> > GHz) using my plug-in replacements for gcc __multf3/__addtf3  
> 
> Scaled to frequency, the hardware implementation on POWER is then
> better by a factor of around four.  Not too bad, actually.
>

If my guess about frequency is correct, then more like factor of 2.6.
Of which, factor of approximately 1.3 has to be attributed to bad
libgcc ABI. 
[O.T.]
BTW, on ARM64 libgcc ABI for __multf3/__addtf3 is similarly bad. The
only decent ABI for __multf3/__addtf3 that I encountered experimenting
on godbolt was for RV64. But that a little consolation considering huge
performance gap between the best RV64 and not even the best, but just a
competent iAMD64 or ARM64.
[/O.T.]

Anyway, performance per clock is of limited interest. What matters is
absolute performance (sometimes throughput, sometimes latency) and
performance per watt.
I would guess, that using SMT4 POWER9 can get over 80% of theoretical
throughput, but getting here would take either multiplying really big
matrix or lots of medium ones.
On EPYC3, on the other hand, I don't expect measurable SMT gain. But
relatively to POWER9 EPYC3 has more cores and much lower power
consumption per core.

> [..]
> >> I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
> >> with one result per cycle, POWER10 has 12 to 13 cycles with two
> >> results per cycle.  
> >
> > So, a bottleneck is somewhere else. May be, multiplication?  
> 
> I messed up the name of the instruction. What I meant was xsmaddqp
> (just trips off the tounge, doesn't it?), which on POWER9 actually
> has a throughput of 1/13 per cycle, a big, fat instruction,
> obviously.  On POWER10, this actually got worse, with performance
> dropping to 1/18 per cycle, with a latency of 25 cycles.  Hm,
> apparently somebody didn't think it was all that important,
> apparently :-(

Sounds like that.
Hopefully it's compensated by better power efficiency. And
unfortunately it's aggravated by lower cost-effectiveness. Or, at least
that what was claimed by poster (luke.l ?) here.