Deutsch English Français Italiano |
<v234nr$12p27$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thomas Koenig <tkoenig@netcologne.de> Newsgroups: comp.arch Subject: Re: Making Lemonade (Floating-point format changes) Date: Wed, 15 May 2024 20:08:27 -0000 (UTC) Organization: A noiseless patient Spider Lines: 29 Message-ID: <v234nr$12p27$1@dont-email.me> References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> Injection-Date: Wed, 15 May 2024 22:08:28 +0200 (CEST) Injection-Info: dont-email.me; posting-host="1d1f0c878087e3f61d225c268691d60c"; logging-data="1139783"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19l+gFcxCSvDcU4K42JQxHjBV0UI6NWgzI=" User-Agent: slrn/1.0.3 (Linux) Cancel-Lock: sha1:PD5Zh0FkkUGvGFqWjy/V0TEBdPA= Bytes: 2305 Michael S <already5chosen@yahoo.com> schrieb: > IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix > multiplication benchmark running on a single POWER9 core. Just reran the tests, it gave me somewhere around 405-410 MFlops on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says). This is with the standard gfortran matmul routine. > I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6 > GHz) using my plug-in replacements for gcc __multf3/__addtf3 Scaled to frequency, the hardware implementation on POWER is then better by a factor of around four. Not too bad, actually. [..] >> I just looked it up - on POWER9, xsaddqp has 12 cycles of latency, >> with one result per cycle, POWER10 has 12 to 13 cycles with two >> results per cycle. > > So, a bottleneck is somewhere else. May be, multiplication? I messed up the name of the instruction. What I meant was xsmaddqp (just trips off the tounge, doesn't it?), which on POWER9 actually has a throughput of 1/13 per cycle, a big, fat instruction, obviously. On POWER10, this actually got worse, with performance dropping to 1/18 per cycle, with a latency of 25 cycles. Hm, apparently somebody didn't think it was all that important, apparently :-(