Liste des Groupes | Revenir à c arch |
Michael S <already5chosen@yahoo.com> writes:...I am sure that they
were aware that this call instruction was expensive, but they
expected that it was worth the cost, and also expected that
implementors would reduce the cost to below what a sequence of
simpler instructions would cost (looking at REP MOVSB in many
generations of Intel and AMD CPUs, we see such expectations
disappointed; I have not measured recent generations, though).It depends on what your call "a sequence of simpler instructions".
For R/E/CX above, say, dozen 'rep movsb' is faster than a simple
non-unrolled loop of single-byte loads and stores on pretty much any
Intel or AMD CPU since a down of time. If we are talking about this
century, then, at least for Intel, I think that we can claim that the
same is true even relatively to simple loop of 32-bit loads and
stores. If we replace a dozen with hundred or three then it will
become true for loop of 64-bit loads/stores as well.
>
Or, may be, in your book 5KB of elaborate code that contains unrolled
and non-unrolled loops of YMM, XMM, Rxx, Exx, and byte memory
accesses still considered 'a sequence of simpler instructions' ?
If it is a case then I am not going to argue.
My experiments were with the code in
<https://github.com/AntonErtl/move/>.
I posted performance results in
<2017Sep19.082137@mips.complang.tuwien.ac.at>
<2017Sep20.184358@mips.complang.tuwien.ac.at>
<2017Sep23.174313@mips.complang.tuwien.ac.at>
My routines were generally faster than rep movsb, except for pretty
large blocks (16KB).
The longest of the routines is ssememmove at 275 bytes.
I expect that an avx512memmove would be quite a bit smaller, thanks to
predication, but I have not yet written that nor measured how that
performs.
- anton
Les messages affichés proviennent d'usenet.