Re: memcpy and friend (was: 80286 protected mode)

Liste des GroupesRevenir à c arch 
Sujet : Re: memcpy and friend (was: 80286 protected mode)
De : david.brown (at) *nospam* hesbynett.no (David Brown)
Groupes : comp.arch
Date : 15. Oct 2024, 12:20:31
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <velj60$1lhfe$2@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0
On 15/10/2024 12:12, Michael S wrote:
On Tue, 15 Oct 2024 10:53:30 +0200
David Brown <david.brown@hesbynett.no> wrote:
 
On 14/10/2024 18:08, Michael S wrote:
On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:
  
On 14/10/2024 16:40, Terje Mathisen wrote:
(I'm snipping for space - hopefully not too much.)

 
REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.
>
With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.
     
>
I agree on all of that.
>
I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep"
instructions.
>
No, that's not true. And according to my understanding, that's not
what Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in
PSW instead of being part of the opcode).
>
My understanding of what Terje wrote is that REP MOVSB /could/ be an
efficient solution if it were backed by a hardware block to run well
(i.e., transferring as many bytes per cycle as memory bus bandwidth
allows).  But REP MOVSB is /not/ efficient - and rather than making
it work faster, Intel introduced variants with wider fixed sizes.
>
 Above count of ~2000 byte REP MOVSB on few latest generations of Intel
and AMD is very efficient.
OK.  That is news to me, and different from what I had thought.

One can construct a case where software implementation is a little
faster in one or another selected benchmark, but typically at cost
of being slower in other situations.
For smaller counts a story is different.
 
Could REP MOVSB realistically be improved to be as efficient as the
instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction?  I
don't know.  Intel and AMD have had many decades to do so, so I
assume it's not an easy improvement.
>
 You somehow assume that REP MOVSB would have to be improved.
That is certainly what I have been assuming.  I haven't investigated it myself in any way, I've merely inferred it from other posts.  So unless someone else provides more information, I'll take your word for it that at least for modern x86 devices and large copies, it's already about as efficient as it could be.

That
remains to be seen.
It's quite likely that when (or 'if', in case of My 66000) those
alternatives you mention hit silicon we will find out that REP MOVSB is
already better as it is, at least for memcpy(). For memmove(), esp.
for short memmove(),  REP MOVSB is easier to beat, because it was not
designed with memmove() in mind.
 
REP MOVSW/D/Q were introduced because back then processors were
small and stupid. When your processor is big and smart you don't
need them any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin
to REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the
next logical step. IMHO, the main gain here is not measurable
improvement in performance, but saving of code size when inlined.
>
Now, is all that a good idea?
>
That's a very important question.
>
I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary
for 1st-class implementation of these instructions is useful not
only for memory copy.
So, may be, it makes sense to expose this hardware in more generic
ways.
>
I believe that is the idea of "scalable vector" instructions as an
alternative philosophy to wide explicit SIMD registers.  My
expectation is that SVE implementations will be more effort in the
hardware than SIMD for any specific SIMD-friendly size point (i.e.,
power-of-two widths).  That usually corresponds to lower clock rates
and/or higher latency and more coordination from extra pipeline
stages.
>
But once you have SVE support in place, then memcpy() and memset()
are just examples of vector operations that you get almost for free
when you have hardware for vector MACs and other operations.
>
 You don't seem to understand what is 'S' in SVE.
Read more manuals. Read less marketing slides.
Or try to write and profile code that utilizes SVE - that would improve
your understanding more than anything else.
 
It means "scalable".  The idea is that the same binary code will use different stride sizes on different hardware - a bigger implementation of the core might have vector units handling wider strides than a smaller one.  Am I missing something?

Also, you don't seem to understand an issue at hand, which is exposing
a hardware that aligns *stream* of N+1 aligned loads turning it into N
unaligned loads.
In absence of 'load multiple' instruction 128-bit SVE would help you
here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
enough, even ignoring absence of prospect of 512-bit SVE in mainstream
ARM64 cores.
May be, at ISA level, SME is a better base to what is wanted.
But
  - SME would be quite bad for copy of small segments.
I would expect a certain amount of overhead, which will be a cost for small copies.

  - SME does not appear to get much love by Arm vendors others than Apple
If you say so.  My main interest is in microcontrollers, and I don't track all the details of larger devices.

  - SME blocks are expected to be implemented not in close proximity to
    the rest of the CPU core, which would make them problematic not just
    for copying small segment, but for medium-length segments (few KB)
    as well.
 
That sounds like a poor design choice to me, but again I don't know the details.

May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better
ways that I was not thinking about.
  
And I fully agree that these would be useful features
in general-purpose processors.
>
My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.
>
You are moving a goalpost.
>
No, my goalposts have been in the same place all the time.  Some
others have been kicking the ball at a completely different set of
goalposts, but I have kept the same point all along.
>
One does not need "good implementation" in a sense you have in
mind.
>
Maybe not - but /that/ would be moving the goalposts.
>
All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very
easily done in standard C. For memmove, I had shown how to do it in
one of the posts below. For memcpy its very obvious, so no need to
show.
>
But that would /not/ be an efficient implementation of memmove() in
plain portable standard C.
>
What do I mean by an "efficient" implementation in fully portable
standard C?  There are two possible ways to think about that.  One is
that the operations on the abstract machine are efficient.  The other
is that the code is likely to result in efficient code over a wide
range of real-world compilers, options, and targets.
 No, there is no need for wide range of compilers or option.
There /is/ a wide range of compilers and options.  If one were to try to make an efficient portable standard C implementation of a function (whether or not it is a standard library function), then it needs to work on any of these compilers with any options as long as they are at least reasonably standards compliant, and it should be reasonably efficient on a large proportion of them.

Standard library (well, may be, I should say "core of standard
library", there is no such thing in the C Standard, but distinctions
exists in many real world implementations, in particular, in gcc) is
compiled with one compiler and one set of options. Or, at most, several
selected sets of options that affect low level code generation, but do
not affect high level optimizations.
Range of targets is indeed desirable, but it does not have to be too
wide.
Of course.  That is the /whole/ point.  A C standard library is part of the implementation - it is tied to the compiler, options and target (as tightly or loosely as you want).  When writing a "memmove()" implementation, there is no requirement for it to be portable or limited to standard C - there is no requirement for it to be in C at all.  That is how we have functions like "memmove" at all, despite the fact that they cannot be implemented efficiently in portable standard C.

 Besides, you forget that arguments were about theoretical possibility
of writing efficient implementation of memmove() in Standard C, not
about practicality of doing so.
I have not forgotten that at all.

My example achieves that target easily, and even exceeds it, because
it's obvious that required pattern matching is not just theoretically
possible. Existing compilers are capable to handle much more complex
cases. They likely can not handle this particular case, but only
because nobody cared to add few dozens lines of code to compiler's
logic.
Just to be clear - your example was this :
void *memmove( void *dest, const void *src, size_t count)
{
  if (count > 0) {
    char tmp[count];
    memcpy(tmp, src, count);
    memcpy(dest, tmp, count);
  }
  return dest;
}
Some existing compilers may recognize that pattern, others do not.  It is certainly true that it is /possible/ for compilers to recognize this pattern.  It is equally certain that virtually all existing C compilers and option combinations do not recognize it.  (Even ones that do, such as clang with -O2, generate a dozen instructions with a call to library memmove() in the middle).  By no conceivable stretch of the imagination is your "solution" here a good, efficient, portable and standard C implementation of memmove().  It may, of course, be a perfectly good implementation for a /specific/ compiler and /specific/ target.

Date Sujet#  Auteur
16 Apr 24 * Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)237Lawrence D'Oliveiro
16 Apr 24 `* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)236David Brown
16 Apr 24  +- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1MitchAlsup1
26 May 24  +- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1MitchAlsup1
1 Oct 24  `* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)233MitchAlsup1
1 Oct 24   `* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)232Thomas Koenig
1 Oct 24    +* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)225MitchAlsup1
2 Oct 24    i+* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)223Brett
3 Oct 24    ii`* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)222Lawrence D'Oliveiro
3 Oct 24    ii +- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1Brett
3 Oct 24    ii +- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1Anton Ertl
3 Oct 24    ii `* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)219David Brown
3 Oct 24    ii  `* Byte ordering (was: Whether something is RISC or not)218Anton Ertl
3 Oct 24    ii   +- Re: Byte ordering (was: Whether something is RISC or not)1David Brown
4 Oct 24    ii   +* Re: Byte ordering (was: Whether something is RISC or not)215Lawrence D'Oliveiro
4 Oct 24    ii   i+- Re: Byte ordering1Lynn Wheeler
4 Oct 24    ii   i+* Re: Byte ordering (was: Whether something is RISC or not)211David Brown
4 Oct 24    ii   ii`* Re: Byte ordering (was: Whether something is RISC or not)210Anton Ertl
4 Oct 24    ii   ii +* Re: Byte ordering5BGB
5 Oct 24    ii   ii i`* Re: Byte ordering4MitchAlsup1
5 Oct 24    ii   ii i +* Re: Byte ordering2BGB
5 Oct 24    ii   ii i i`- Re: Byte ordering1Lawrence D'Oliveiro
5 Oct 24    ii   ii i `- Re: Byte ordering1Lawrence D'Oliveiro
5 Oct 24    ii   ii +* Re: Byte ordering (was: Whether something is RISC or not)13Lawrence D'Oliveiro
5 Oct 24    ii   ii i`* Re: Byte ordering (was: Whether something is RISC or not)12Brett
5 Oct 24    ii   ii i `* Re: Byte ordering (was: Whether something is RISC or not)11Anton Ertl
5 Oct 24    ii   ii i  `* Re: Byte ordering (was: Whether something is RISC or not)10Michael S
6 Oct 24    ii   ii i   +- Re: Byte ordering1Terje Mathisen
6 Oct 24    ii   ii i   `* Re: Byte ordering (was: Whether something is RISC or not)8Brett
7 Oct 24    ii   ii i    `* Re: Byte ordering (was: Whether something is RISC or not)7Lawrence D'Oliveiro
7 Oct 24    ii   ii i     `* Re: Byte ordering (was: Whether something is RISC or not)6Brett
7 Oct 24    ii   ii i      `* Re: Byte ordering (was: Whether something is RISC or not)5Michael S
7 Oct 24    ii   ii i       +* Re: Byte ordering2Stefan Monnier
7 Oct 24    ii   ii i       i`- Re: Byte ordering1Michael S
7 Oct 24    ii   ii i       `* Re: Byte ordering (was: Whether something is RISC or not)2Lawrence D'Oliveiro
8 Oct 24    ii   ii i        `- Re: Byte ordering1Terje Mathisen
6 Oct 24    ii   ii `* Re: Byte ordering191David Brown
6 Oct 24    ii   ii  `* Re: Byte ordering190Anton Ertl
6 Oct 24    ii   ii   `* Re: Byte ordering189John Dallman
7 Oct 24    ii   ii    +* Re: Byte ordering20Lawrence D'Oliveiro
8 Oct 24    ii   ii    i`* Re: Byte ordering19John Dallman
9 Oct 24    ii   ii    i +- VMS/NT memory management (was: Byte ordering)1Stefan Monnier
15 Oct 24    ii   ii    i +* Re: Byte ordering2Lawrence D'Oliveiro
15 Oct 24    ii   ii    i i`- Re: Byte ordering1MitchAlsup1
15 Oct 24    ii   ii    i `* Re: Byte ordering15Lawrence D'Oliveiro
15 Oct 24    ii   ii    i  +* Re: Byte ordering3Michael S
15 Oct 24    ii   ii    i  i+- Re: Byte ordering1John Dallman
18 Oct 24    ii   ii    i  i`- Re: Byte ordering1Lawrence D'Oliveiro
15 Oct 24    ii   ii    i  +* Re: Byte ordering9John Dallman
16 Oct 24    ii   ii    i  i+* Re: Byte ordering7George Neuner
16 Oct 24    ii   ii    i  ii`* Re: Byte ordering6Terje Mathisen
16 Oct 24    ii   ii    i  ii `* Re: Byte ordering5David Brown
17 Oct 24    ii   ii    i  ii  +* Re: Byte ordering2George Neuner
17 Oct 24    ii   ii    i  ii  i`- Re: Byte ordering1David Brown
17 Oct 24    ii   ii    i  ii  `* Re: clouds, not Byte ordering2John Levine
17 Oct 24    ii   ii    i  ii   `- Re: clouds, not Byte ordering1David Brown
18 Oct 24    ii   ii    i  i`- Re: Byte ordering1Lawrence D'Oliveiro
16 Oct 24    ii   ii    i  `* Re: Byte ordering2Paul A. Clayton
18 Oct 24    ii   ii    i   `- Re: Microkernels & Capabilities (was Re: Byte ordering)1Lawrence D'Oliveiro
7 Oct 24    ii   ii    `* 80286 protected mode168Anton Ertl
7 Oct 24    ii   ii     +* Re: 80286 protected mode5Lars Poulsen
7 Oct 24    ii   ii     i`* Re: 80286 protected mode4Terje Mathisen
7 Oct 24    ii   ii     i +- Re: 80286 protected mode1Michael S
7 Oct 24    ii   ii     i `* Re: 80286 protected mode2Lawrence D'Oliveiro
8 Oct 24    ii   ii     i  `- Re: 80286 protected mode1Terje Mathisen
7 Oct 24    ii   ii     +* Re: 80286 protected mode3Brett
7 Oct 24    ii   ii     i`* Re: 80286 protected mode2Michael S
7 Oct 24    ii   ii     i `- Re: 80286 protected mode1Brett
7 Oct 24    ii   ii     +- Re: 80286 protected mode1Lawrence D'Oliveiro
8 Oct 24    ii   ii     +* Re: 80286 protected mode152MitchAlsup1
8 Oct 24    ii   ii     i+* Re: 80286 protected mode4Lawrence D'Oliveiro
8 Oct 24    ii   ii     ii`* Re: 80286 protected mode3MitchAlsup1
9 Oct 24    ii   ii     ii +- Re: 80286 protected mode1David Brown
15 Oct 24    ii   ii     ii `- Re: 80286 protected mode1Lawrence D'Oliveiro
8 Oct 24    ii   ii     i`* Re: 80286 protected mode147Anton Ertl
8 Oct 24    ii   ii     i +- Re: 80286 protected mode1Robert Finch
9 Oct 24    ii   ii     i `* Re: 80286 protected mode145David Brown
9 Oct 24    ii   ii     i  +* Re: 80286 protected mode79MitchAlsup1
9 Oct 24    ii   ii     i  i`* Re: 80286 protected mode78David Brown
9 Oct 24    ii   ii     i  i `* Re: 80286 protected mode77Stephen Fuld
10 Oct 24    ii   ii     i  i  +* Re: 80286 protected mode2MitchAlsup1
10 Oct 24    ii   ii     i  i  i`- Re: 80286 protected mode1David Brown
10 Oct 24    ii   ii     i  i  +- Re: 80286 protected mode1David Brown
11 Oct 24    ii   ii     i  i  `* Re: 80286 protected mode73Tim Rentsch
15 Oct 24    ii   ii     i  i   `* Re: 80286 protected mode72Stefan Monnier
15 Oct 24    ii   ii     i  i    +* Re: 80286 protected mode30MitchAlsup1
16 Oct 24    ii   ii     i  i    i+* Re: 80286 protected mode25MitchAlsup1
16 Oct 24    ii   ii     i  i    ii+* Re: C and turtles, 80286 protected mode13John Levine
16 Oct 24    ii   ii     i  i    iii+* Re: C and turtles, 80286 protected mode7MitchAlsup1
16 Oct 24    ii   ii     i  i    iiii`* Re: C and turtles, 80286 protected mode6John Levine
17 Oct 24    ii   ii     i  i    iiii `* Re: C and turtles, 80286 protected mode5Thomas Koenig
20 Oct 24    ii   ii     i  i    iiii  `* Re: C and turtles, 80286 protected mode4Lawrence D'Oliveiro
20 Oct 24    ii   ii     i  i    iiii   `* Re: C and turtles, 80286 protected mode3George Neuner
22 Oct 24    ii   ii     i  i    iiii    `* Re: C and turtles, 80286 protected mode2Tim Rentsch
22 Oct 24    ii   ii     i  i    iiii     `- Re: C and turtles, 80286 protected mode1George Neuner
16 Oct 24    ii   ii     i  i    iii+- Re: C and turtles, 80286 protected mode1David Brown
16 Oct 24    ii   ii     i  i    iii`* Re: C and turtles, 80286 protected mode4Paul A. Clayton
17 Oct 24    ii   ii     i  i    iii +- Re: C and turtles, 80286 protected mode1David Brown
20 Oct 24    ii   ii     i  i    iii `* Re: C and turtles, 80286 protected mode2Lawrence D'Oliveiro
20 Oct 24    ii   ii     i  i    iii  `- Re: C and turtles, 80286 protected mode1Paul A. Clayton
16 Oct 24    ii   ii     i  i    ii+* Re: 80286 protected mode7Thomas Koenig
17 Oct 24    ii   ii     i  i    ii+* Re: 80286 protected mode3George Neuner
17 Oct 24    ii   ii     i  i    ii`- Re: 80286 protected mode1Tim Rentsch
16 Oct 24    ii   ii     i  i    i+* Re: 80286 protected mode3David Brown
17 Oct 24    ii   ii     i  i    i`- Re: 80286 protected mode1Tim Rentsch
16 Oct 24    ii   ii     i  i    `* Re: 80286 protected mode41David Brown
9 Oct 24    ii   ii     i  +* Re: 80286 protected mode51Thomas Koenig
13 Oct 24    ii   ii     i  `* Re: 80286 protected mode14Anton Ertl
8 Oct 24    ii   ii     `* Re: 80286 protected mode6John Levine
6 Oct 24    ii   i`* Re: Byte ordering (was: Whether something is RISC or not)2Michael S
4 Oct 24    ii   `- Re: Byte ordering (was: Whether something is RISC or not)1John Dallman
2 Oct 24    i`- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1Thomas Koenig
2 Oct 24    +* Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)5David Schultz
3 Oct 24    `- Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)1Lawrence D'Oliveiro

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal