Sujet : Re: Making Lemonade (Floating-point format changes)
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.archDate : 15. May 2024, 22:16:28
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240516001628.00001031@yahoo.com>
References : 1 2 3 4 5 6 7
User-Agent : Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
On Wed, 15 May 2024 20:08:27 -0000 (UTC)
Thomas Koenig <
tkoenig@netcologne.de> wrote:
Michael S <already5chosen@yahoo.com> schrieb:
IIRC, you reported something like 200 (or 300?) MFLOPS for your
matrix multiplication benchmark running on a single POWER9 core.
Not too bad. Not too good, either.
Just reran the tests, it gave me somewhere around 405-410 MFlops
on a POWER9 machine running at 2.2 GHz (or so /proc/cpuinfo says).
This is with the standard gfortran matmul routine.
I don't think that nowadays /proc/cpuinfo has any relationship to
actual frequency. Most likely with a single core active even the
cheapest POWER9 SKU runs at 3.8 GHz.
If there is no ready-made utility, you can measure it by yourself with
latency-bound loop. Just don't forget that on POWER9 all simple integer
opcodes have latency=2.
If there are any difficulties, I can help.
I got ~150 MFLOPS running on EPYC3 at relatively low frequency (3.6
GHz) using my plug-in replacements for gcc __multf3/__addtf3
Scaled to frequency, the hardware implementation on POWER is then
better by a factor of around four. Not too bad, actually.
>
If my guess about frequency is correct, then more like factor of 2.6.
Of which, factor of approximately 1.3 has to be attributed to bad
libgcc ABI.
[O.T.]
BTW, on ARM64 libgcc ABI for __multf3/__addtf3 is similarly bad. The
only decent ABI for __multf3/__addtf3 that I encountered experimenting
on godbolt was for RV64. But that a little consolation considering huge
performance gap between the best RV64 and not even the best, but just a
competent iAMD64 or ARM64.
[/O.T.]
Anyway, performance per clock is of limited interest. What matters is
absolute performance (sometimes throughput, sometimes latency) and
performance per watt.
I would guess, that using SMT4 POWER9 can get over 80% of theoretical
throughput, but getting here would take either multiplying really big
matrix or lots of medium ones.
On EPYC3, on the other hand, I don't expect measurable SMT gain. But
relatively to POWER9 EPYC3 has more cores and much lower power
consumption per core.
[..]
I just looked it up - on POWER9, xsaddqp has 12 cycles of latency,
with one result per cycle, POWER10 has 12 to 13 cycles with two
results per cycle.
>
So, a bottleneck is somewhere else. May be, multiplication?
I messed up the name of the instruction. What I meant was xsmaddqp
(just trips off the tounge, doesn't it?), which on POWER9 actually
has a throughput of 1/13 per cycle, a big, fat instruction,
obviously. On POWER10, this actually got worse, with performance
dropping to 1/18 per cycle, with a latency of 25 cycles. Hm,
apparently somebody didn't think it was all that important,
apparently :-(
Sounds like that.
Hopefully it's compensated by better power efficiency. And
unfortunately it's aggravated by lower cost-effectiveness. Or, at least
that what was claimed by poster (luke.l ?) here.