Liste des Groupes | Revenir à c arch |
Michael S <already5chosen@yahoo.com> writes:On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:
Now it's worse, it's quoted-printable. E.g.:
if (li >=3D len || li <=3D 0)
Some newsreaders can decode this, mine does not.
First cycles (which eliminates worries about turbo modes) and>
instructions, then usec/call.
=20
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and
5.15 usec per call on Haswell. You numbers don't even resamble
anything like that.
My cycle numbers are for the whole program that calls keylocks()
100_000 times.
If you divide the cycles by 100000, you get 21954 for clang
keylocks1-256, which is what you expect.
instructions>
5_779_542_242 gcc avx2 1 =20
3_484_942_148 gcc avx2 2 8=20
5_885_742_164 gcc avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc 512 1 =20
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
I don't understand these numbers either. For original clang, I'd
expect 25,000 instructions per call.
clang keylocks1-256 performs 79031 instructions per call (divide the
number given by 100000 calls). If you want to see why that is, you
need to analyse the code produced by clang, which I did only for
select cases.
Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.
I just report one run with 100_000 calls, and just hope that the
variation is small:-) In my last refereed paper I use 30 runs and
median, but I don't go to these lengths here; the cycles seem pretty
repeatable.
On the Golden Cove of a Core i3-1315U (compared to the best result>
by Terje Mathisen on a Core i7-1365U; the latter can run up to
5.2GHz according to Intel, whereas the former can supposedly run
up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
this time as well):
=20
I always thought that NUCs have better cooling than all, but high-end
laptops. Was I wrong? Such slowness is disappointing.
The cooling may be better or not, that does not come into play here,
as it never reaches higher clocks, even when it's cold; E-cores also
stay 700MHz below their rated turbo speed, even when it's the only
loaded core. One theory I have is that one option we set up in the
BIOS has the effect of limiting turbo speed, but it has not been
important enough to test.
5.25us Terje Mathisen's Rust code compiled by clang (best on the>
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
So, for the best-performing variant IPC of Goldeen Cove is identical
to ancient Haswell?
Actually worse:
For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all
AVX2 integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
As far as I looked into the code, I did not see such a bottleneck.
Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
clang keylocks2-256), and I expect that it would suffer from a general
latency bottleneck, too. Rocket Lake is also faster on this program
than Haswell and Golden Cove. It seems to be just that this program
rubs Golden Cove the wrong way.
I would have expected the clang keylocks1-256 to run slower,>
because the compiler back-end is the same and the 1315U is slower.
Measuring cycles looks more relevant for this benchmark to me
than measuring time, especially on this core where AVX-512 is
disabled and there is no AVX slowdown.
=20
I prefer time, because at the end it's the only thing that matter.
True, and certainly, when stuff like AVX-512 license-based
downclocking or thermal or power limits come into play (and are
relevant for the measurement at hand), one has to go there. But then
you can only compare code running on the same kind of machine,
configured the same way. Or maybe just running on the same
machine:-). But then, the generality of the results is questionable.
- anton
Les messages affichés proviennent d'usenet.