Sujet : Re: Making Lemonade (Floating-point format changes)
De : tkoenig (at) *nospam* netcologne.de (Thomas Koenig)
Groupes : comp.archDate : 15. May 2024, 21:08:27
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v234nr$12p27$1@dont-email.me>
References : 1 2 3 4 5 6
User-Agent : slrn/1.0.3 (Linux)
Michael S <
already5chosen@yahoo.com> schrieb:
IIRC, you reported something like 200 (or 300?) MFLOPS for your matrix
multiplication benchmark running on a single POWER9 core.
Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.
I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3
Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four. Not too bad, actually.
[..]
I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.
>
So, a bottleneck is somewhere else. May be, multiplication?
I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously. On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
apparently somebody didn't think it was all that important,
apparently :-(