Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <20240514221659.00001094@yahoo.com>
Deutsch   English   Français   Italiano  
<20240514221659.00001094@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!feeds.phibee-telecom.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Making Lemonade (Floating-point format changes)
Date: Tue, 14 May 2024 22:19:25 +0300
Organization: A noiseless patient Spider
Lines: 74
Message-ID: <20240514221659.00001094@yahoo.com>
References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>
	<memo.20240512203459.16164W@jgd.cix.co.uk>
	<v1rab7$2vt3u$1@dont-email.me>
	<20240513151647.0000403f@yahoo.com>
	<v1to2h$3km86$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 14 May 2024 21:19:29 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="dcb64060f415e7fe23ed2e09ecf0fa94";
	logging-data="376727"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19bGwAiAAMhCjp4GkP97eP0+QV/bTpX6mY="
Cancel-Lock: sha1:wOiPbQ+Zp2tkIsI1PdV0M8XTIbo=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 4013

On Mon, 13 May 2024 19:01:37 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:

> Michael S <already5chosen@yahoo.com> schrieb:
> > On Sun, 12 May 2024 20:55:03 -0000 (UTC)
> > Thomas Koenig <tkoenig@netcologne.de> wrote:
> >  
> >> John Dallman <jgd@cix.co.uk> schrieb:  
> >> > In article <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com>,
> >> > quadibloc@servername.invalid (John Savard) wrote:
> >> >    
> >> >> I'm not really sure such floating-pont precision is useful, but
> >> >> I do remember some people telling me that higher float
> >> >> precision is indeed something to be desired.    
> >>   
> >> > I would be in favour of 128-bit being available.    
> >> 
> >> Me, too.  Solving tricky linear systems, or obtaining derivatives
> >> numerically (for example for Jacobians) eats up a _lot_ of
> >> precision bits, and double precision can sometimes run into
> >> trouble.
> >> 
> >> At least gcc and gfortran now support POWER's native 128-bit format
> >> in hardware.  On other systems, software emulation is used, which
> >> is of course much slower.
> >>   
> >
> > Much slower?
> > I think, at least for matrix multiplication, my emulation on modern
> > x86 was within factor of 1.5x from your measurements on POWER9.  
> 
> I don't remember the exact timing, and it might be interesting to
> revisit that (also considering that the

IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix
multiplication benchmark running on a single POWER9 core.

I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3 with the
level of support for FP exceptions and rounding modes that, according
to you, is sufficient for Fortran, but according to other gnu
maintainers is insufficient for C. For matrix multiplication
implemented with vector APIs ('multiply vector by scalar' and 'add
vectors') on the same EPYC3 I got approximately 200 MFLOPS.

> gfortran code for matmul is
> not optimized for 128-bit float and might have blown cache sizes,

That's possible, but unlikely to make a major impact.
At 200 MFLOPS even L3 cache is not a bottleneck. And it's actually hard
to code matrix multiplication so poorly that at least half of data
wouldn't come from L1D/L2. I took a look at GFortran sources for matmul
- they are not that bad.

> plus it would be fair to compare compiler vs. compiler and assembler
> vs. assembler).
>

My routines are implemented in 'C' and compiled with gcc

> I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
> with one result per cycle, POWER10 has 12 to 13 cycles with two
> results per cycle.

So, a bottleneck is somewhere else. May be, multiplication?

> 
> What can your code get on x86_64?

Se above.