Deutsch English Français Italiano |
<20240516001628.00001031@yahoo.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Michael S <already5chosen@yahoo.com> Newsgroups: comp.arch Subject: Re: Making Lemonade (Floating-point format changes) Date: Thu, 16 May 2024 00:16:28 +0300 Organization: A noiseless patient Spider Lines: 76 Message-ID: <20240516001628.00001031@yahoo.com> References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <memo.20240512203459.16164W@jgd.cix.co.uk> <v1rab7$2vt3u$1@dont-email.me> <20240513151647.0000403f@yahoo.com> <v1to2h$3km86$1@dont-email.me> <20240514221659.00001094@yahoo.com> <v234nr$12p27$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Date: Wed, 15 May 2024 23:16:33 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5bb281324a08f451b7dc707065bf0b6d"; logging-data="1114521"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/1+Cgq8faS10o+Pl3ZZqKkU2Pxsotrd6w=" Cancel-Lock: sha1:Ok3MWih62Ceb9MfAsC7Y0nwvxeA= X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32) Bytes: 4289 On Wed, 15 May 2024 20:08:27 -0000 (UTC) Thomas Koenig <tkoenig@netcologne.de> wrote: > Michael S <already5chosen@yahoo.com> schrieb: > > > IIRC, you reported something like 200 (or 300?) MFLOPS for your > > matrix multiplication benchmark running on a single POWER9 core. > Not too bad. Not too good, either. > Just reran the tests, it gave me somewhere around 405-410 MFlops > on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says). > This is with the standard gfortran matmul routine. > I don't think that nowadays /proc/cpuinfo has any relationship to actual frequency. Most likely with a single core active even the cheapest POWER9 SKU runs at 3.8 GHz. If there is no ready-made utility, you can measure it by yourself with latency-bound loop. Just don't forget that on POWER9 all simple integer opcodes have latency=2. If there are any difficulties, I can help. > > I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6 > > GHz) using my plug-in replacements for gcc __multf3/__addtf3 > > Scaled to frequency, the hardware implementation on POWER is then > better by a factor of around four. Not too bad, actually. > If my guess about frequency is correct, then more like factor of 2.6. Of which, factor of approximately 1.3 has to be attributed to bad libgcc ABI. [O.T.] BTW, on ARM64 libgcc ABI for __multf3/__addtf3 is similarly bad. The only decent ABI for __multf3/__addtf3 that I encountered experimenting on godbolt was for RV64. But that a little consolation considering huge performance gap between the best RV64 and not even the best, but just a competent iAMD64 or ARM64. [/O.T.] Anyway, performance per clock is of limited interest. What matters is absolute performance (sometimes throughput, sometimes latency) and performance per watt. I would guess, that using SMT4 POWER9 can get over 80% of theoretical throughput, but getting here would take either multiplying really big matrix or lots of medium ones. On EPYC3, on the other hand, I don't expect measurable SMT gain. But relatively to POWER9 EPYC3 has more cores and much lower power consumption per core. > [..] > >> I just looked it up - on POWER9, xsaddqp has 12 cycles of latency, > >> with one result per cycle, POWER10 has 12 to 13 cycles with two > >> results per cycle. > > > > So, a bottleneck is somewhere else. May be, multiplication? > > I messed up the name of the instruction. What I meant was xsmaddqp > (just trips off the tounge, doesn't it?), which on POWER9 actually > has a throughput of 1/13 per cycle, a big, fat instruction, > obviously. On POWER10, this actually got worse, with performance > dropping to 1/18 per cycle, with a latency of 25 cycles. Hm, > apparently somebody didn't think it was all that important, > apparently :-( Sounds like that. Hopefully it's compensated by better power efficiency. And unfortunately it's aggravated by lower cost-effectiveness. Or, at least that what was claimed by poster (luke.l ?) here.