Sujet : Re: Computer architects leaving Intel...
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 09. Sep 2024, 08:07:25
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Sep9.090725@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : xrn 10.11
Tim Rentsch <
tr.17687@z991.linuxsc.com> writes:
mitchalsup@aol.com (MitchAlsup1) writes:
So:
# define memcpy memomve
>
Incidentally, if one wants to do this, it's advisable to write
>
#undef memcpy
>
before the #define of memcpy.
>
and move forward with life--for the 2 extra cycles memmove costs
it saves everyone long term grief.
Is it two extra cycles? Here are some data points from
<
2017Sep23.174313@mips.complang.tuwien.ac.at>:
Haswell (Core i7-4790K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 15 15 17 30 48 85 150 281 570 1370 memmove
15 16 13 16 19 32 48 86 161 327 631 1420 memcpy
Skylake (Core i5-6600K), glibc 2.19
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
14 14 14 14 15 27 43 77 147 305 573 1417 memmove
13 14 10 12 14 27 46 85 165 313 607 1350 memcpy
Zen (Ryzen 5 1600X), glibc 2.24
1 8 32 64 128 256 512 1K 2K 4K 8K 16K block size
16 16 16 17 32 43 66 107 177 328 601 1225 memmove
13 13 14 13 38 49 73 116 188 336 610 1233 memcpy
I don't see a consistent speedup of memcpy over memmove here.
However, when one uses memcpy(&var,ptr,8) or the like to perform an
unaligned access, gcc transforms this into a load (or store) without
the redefinition of memcpy, but into much slower code with the
redefinition (i.e., when using memmove instead of memcpy).
Simply replacing memcpy() by memmove() of course will always
work, but there might be negative consequences beyond a cost
of 2 extra cycles -- for example, if a negative stride is
better performing than a positive stride, but the nature
of the compaction forces memmove() to always take the slower
choice.
If the two memory blocks don't overlap, memmove() can use the fastest
stride. If the two memory blocks overlap, memcpy() as implemented in
glibc is a bad idea.
The way to go for memmove() is:
On hardware where positive stride is faster:
if (((uintptr)(dest-src)) >= len)
return memcpy_posstride(dest,src,len)
else
return memcpy_negstride(dest,src,len)
On hardware where the negative stride is faster:
if (((uintptr)(src-dest)) >= len)
return memcpy_negstride(dest,src,len)
else
return memcpy_posstride(dest,src,len)
And I expect that my test is undefined behaviour, but most people
except the UB advocates should understand what I mean.
The benefit of this comparison over just comparing the addresses is
that the branch will have a much lower miss rate.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>