Liste des Groupes | Revenir à c arch |
On Sat, 08 Feb 2025 08:11:04 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
Or by my own pasting mistake. I am still not sure whom to blame.
The mistake was tiny - absence of // at the begining of one line, but
enough to not compile. Trying it for a second time:
if (li >=3D len || li <=3D 0)
First cycles (which eliminates worries about turbo modes) and>
instructions, then usec/call.
=20
I don't understand that.
For original code optimized by clang I'd expect 22,000 cycles and 5.15
usec per call on Haswell. You numbers don't even resamble anything like
that.
instructions>
5_779_542_242 gcc avx2 1 =20
3_484_942_148 gcc avx2 2 8=20
5_885_742_164 gcc avx2 3 8=20
7_903_138_230 clang avx2 1 =20
7_743_938_183 clang avx2 2 8?
3_625_338_104 clang avx2 3 8?=20
4_204_442_194 gcc 512 1 =20
2_564_142_161 gcc 512 2 32
3_061_042_178 gcc 512 3 16
7_703_938_205 clang 512 1 =20
3_402_238_102 clang 512 2 16?
3_320_455_741 clang 512 3 16?
=20
I don't understand these numbers either. For original clang, I'd expect
25,000 instructions per call.
Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
Which could be due to differences in measurements methodology - I
reported median of 11 runs, you seems to report average.
On the Golden Cove of a Core i3-1315U (compared to the best result by>
Terje Mathisen on a Core i7-1365U; the latter can run up to 5.2GHz
according to Intel, whereas the former can supposedly run up to
4.5GHz; I only ever measured at most 3.8GHz on our NUC, and this time
as well):
=20
I always thought that NUCs have better cooling than all, but high-end
laptops. Was I wrong? Such slowness is disappointing.
5.25us Terje Mathisen's Rust code compiled by clang (best on the>
1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
4.17us gcc keylocks1-256 on a 3.8GHz 1315U
3.16us gcc keylocks2-256 on a 3.8GHz 1315U
2.38us clang keylocks2-512 on a 3.8GHz 1315U
=20
So, for the best-performing variant IPC of Goldeen Cove is identical to
ancient Haswell?
That's very disappointing. Haswell has 4-wide front
end and majority of AVX2 integer instruction is limited to throughput
of two per clock. Golden Cove has 5+ wide front end and nearly all AVX2
integer instruction have throughput of three per clock.
Could it be that clang introduced some sort of latency bottleneck?
I would have expected the clang keylocks1-256 to run slower, because>
the compiler back-end is the same and the 1315U is slower. Measuring
cycles looks more relevant for this benchmark to me than measuring
time, especially on this core where AVX-512 is disabled and there is
no AVX slowdown.
=20
I prefer time, because at the end it's the only thing that matter.
Les messages affichés proviennent d'usenet.