Newsportal USENET - memcpy and friend (was: 80286 protected mode)

On Tue, 15 Oct 2024 10:53:30 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 18:08, Michael S wrote:
On Mon, 14 Oct 2024 17:19:40 +0200
David Brown <david.brown@hesbynett.no> wrote:

On 14/10/2024 16:40, Terje Mathisen wrote:
David Brown wrote:
On 13/10/2024 21:21, Terje Mathisen wrote:
David Brown wrote:
On 10/10/2024 20:38, MitchAlsup1 wrote:
On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote:

On 09/10/2024 23:37, MitchAlsup1 wrote:
On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote:

On 09/10/2024 20:10, Thomas Koenig wrote:
David Brown <david.brown@hesbynett.no> schrieb:

When would you ever /need/ to compare pointers to
different objects?
For almost all C programmers, the answer is "never".
>
Sometimes, it is handy to encode certain conditions in
pointers, rather than having only a valid pointer or
NULL.Ã‚Â A compiler, for example, might want to store the
fact that an error occurred while parsing a subexpression
as a special pointer constant.
>
Compilers often have the unfair advantage, though, that
they can rely on what application programmers cannot,
their implementation details.Ã‚Â (Some do not, such as
f2c).
>
Standard library authors have the same superpowers, so that
they can
implement an efficient memmove() even though a pure
standard C programmer cannot (other than by simply calling
the standard library
memmove() function!).
>
This is more a symptom of bad ISA design/evolution than of
libc writers needing superpowers.
>
No, it is not.Ã‚Â It has absolutely /nothing/ to do with the
ISA.
>
For example, if ISA contains an MM instruction which is the
embodiment of memmove() then absolutely no heroics are needed
of desired in the libc call.

>
The existence of a dedicated assembly instruction does not let
you write an efficient memmove() in standard C.Â That's why I
said there was no connection between the two concepts.
>
For some targets, it can be helpful to write memmove() in
assembly or using inline assembly, rather than in non-portable
C (which is the common case).

Thus, it IS a symptom of ISA evolution that one has to rewrite
memmove() every time wider SIMD registers are available.
>
It is not that simple.
>
There can often be trade-offs between the speed of memmove()
and memcpy() on large transfers, and the overhead in setting
things up that is proportionally more costly for small
transfers.Â Often that can be eliminated when the compiler
optimises the functions inline - when the compiler knows the
size of the move/copy, it can optimise directly.
>
What you are missing here David is the fact that Mitch's MM is a
single instruction which does the entire memmove() operation,
and has the inside knowledge about cache (residency at level x?
width in bytes)/memory ranges/access rights/etc needed to do so
in a very close to optimal manner, for both short and long
transfers.
>
I am not missing that at all. And I agree that an advanced
hardware MM instruction could be a very efficient way to
implement both memcpy and memmove. (For my own kind of work,
I'd worry about such looping instructions causing an unbounded
increased in interrupt latency, but that too is solvable given
enough hardware effort.)
>
And I agree that once you have an "MM" (or similar) instruction,
you don't need to re-write the implementation for your memmove()
and memcpy() library functions for every new generation of
processors of a given target family.
>
What I /don't/ agree with is the claim that you /do/ need to keep
re-writing your implementations all the time. You will
/sometimes/ get benefits from doing so, but it is not as simple
as Mitch made out.
>
I.e. totally removing the need for compiler tricks or wide
register operations.
>
Also apropos the compiler library issue:
>
You start by teaching the compiler about the MM instruction, and
to recognize common patterns (just as most compilers already do
today), and then the memmove() calls will usually be inlined.

>
The original compile library issue was that it is impossible to
write an efficient memmove() implementation using pure portable
standard C. That is independent of any ISA, any specialist
instructions for memory moves, and any compiler optimisations.
And it is independent of the fact that some good compilers can
inline at least some calls to memcpy() and memmove() today, using
whatever instructions are most efficient for the target.
>
David, you and Mitch are among my most cherished writers here on
c.arch, I really don't think any of us really disagree, it is just
that we have been discussing two (mostly) orthogonal issues.
>
I agree. It's a "god dag mann, økseskaft" situation.
>
I have a huge respect for Mitch, his knowledge and experience, and
his willingness to share that freely with others. That's why I
have found this very frustrating.

>
a) memmove/memcpy are so important that people have been spending
a lot of time & effort trying to make it faster, with the
complication that in general it cannot be implemented in pure C
(which disallows direct comparison of arbitrary pointers).

>
Yes.
>
(Unlike memmov(), memcpy() can be implemented in standard C as a
simple byte-copy loop, without needing to compare pointers. But an
implementation that copies in larger blocks than a byte requires
implementation dependent behaviour to determine alignments, or it
must rely on unaligned accesses being allowed by the
implementation.)
b) Mitch have, like Andy ("Crazy") Glew many years before,
realized that if a cpu architecture actually has an instruction
designed to do this particular job, it behooves cpu architects to
make sure that it is in fact so fast that it obviates any need
for tricky coding to replace it.
>
Yes.

Ideally, it should be able to copy a single object, up to a cache
line in size, in the same or less time needed to do so manually
with a SIMD 512-bit load followed by a 512-bit store (both ops
masked to not touch anything it shouldn't)

>
Yes.

REP MOVSB on x86 does the canonical memcpy() operation, originally
by moving single bytes, and this was so slow that we also had REP
MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
REP MOVSQ on 64-bit cpus.
>
With a suitable chunk of logic, the basic MOVSB operation could in
fact handle any kinds of alignments and sizes, while doing the
actual transfer at maximum bus speeds, i.e. at least one cache
line/cycle for things already in $L1.

>
I agree on all of that.
>
I am quite happy with the argument that suitable hardware can do
these basic operations faster than a software loop or the x86 "rep"
instructions.

No, that's not true. And according to my understanding, that's not
what Terje wrote.
REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
details - fixed registers for src, dest, len and Direction flag in
PSW instead of being part of the opcode).

My understanding of what Terje wrote is that REP MOVSB /could/ be an
efficient solution if it were backed by a hardware block to run well
(i.e., transferring as many bytes per cycle as memory bus bandwidth
allows). But REP MOVSB is /not/ efficient - and rather than making
it work faster, Intel introduced variants with wider fixed sizes.
>

Above count of ~2000 byte REP MOVSB on few latest generations of Intel
and AMD is very efficient.
One can construct a case where software implementation is a little
faster in one or another selected benchmark, but typically at cost
of being slower in other situations.
For smaller counts a story is different.

Could REP MOVSB realistically be improved to be as efficient as the
instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction? I
don't know. Intel and AMD have had many decades to do so, so I
assume it's not an easy improvement.
>

You somehow assume that REP MOVSB would have to be improved. That
remains to be seen.
It's quite likely that when (or 'if', in case of My 66000) those
alternatives you mention hit silicon we will find out that REP MOVSB is
already better as it is, at least for memcpy(). For memmove(), esp.
for short memmove(), REP MOVSB is easier to beat, because it was not
designed with memmove() in mind.

REP MOVSW/D/Q were introduced because back then processors were
small and stupid. When your processor is big and smart you don't
need them any longer. REP MOVSB is sufficient.
New Arm64 instruction that are hopefully coming next year are akin
to REP MOVSB rather than to MOVSW/D/Q.
Instructions for memmove, also defined by Arm and by Mitch, is the
next logical step. IMHO, the main gain here is not measurable
improvement in performance, but saving of code size when inlined.

Now, is all that a good idea?

That's a very important question.

I am not 100% convinced.
One can argue that streaming alignment hardware that is necessary
for 1st-class implementation of these instructions is useful not
only for memory copy.
So, may be, it makes sense to expose this hardware in more generic
ways.

I believe that is the idea of "scalable vector" instructions as an
alternative philosophy to wide explicit SIMD registers. My
expectation is that SVE implementations will be more effort in the
hardware than SIMD for any specific SIMD-friendly size point (i.e.,
power-of-two widths). That usually corresponds to lower clock rates
and/or higher latency and more coordination from extra pipeline
stages.

But once you have SVE support in place, then memcpy() and memset()
are just examples of vector operations that you get almost for free
when you have hardware for vector MACs and other operations.
>

You don't seem to understand what is 'S' in SVE.
Read more manuals. Read less marketing slides.
Or try to write and profile code that utilizes SVE - that would improve
your understanding more than anything else.

Also, you don't seem to understand an issue at hand, which is exposing
a hardware that aligns *stream* of N+1 aligned loads turning it into N
unaligned loads.
In absence of 'load multiple' instruction 128-bit SVE would help you
here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
enough, even ignoring absence of prospect of 512-bit SVE in mainstream
ARM64 cores.
May be, at ISA level, SME is a better base to what is wanted.
But
- SME would be quite bad for copy of small segments.
- SME does not appear to get much love by Arm vendors others than Apple
- SME blocks are expected to be implemented not in close proximity to
   the rest of the CPU core, which would make them problematic not just
   for copying small segment, but for medium-length segments (few KB)
   as well.

May be, via Load Multiple Register? It was present in Arm's A32/T32,
but didn't make it into ARM64. Or, may be, there are even better
ways that I was not thinking about.

And I fully agree that these would be useful features
in general-purpose processors.
>
My only point of contention is that the existence or lack of such
instructions does not make any difference to whether or not you can
write a good implementation of memcpy() or memmove() in portable
standard C.

You are moving a goalpost.

No, my goalposts have been in the same place all the time. Some
others have been kicking the ball at a completely different set of
goalposts, but I have kept the same point all along.

One does not need "good implementation" in a sense you have in
mind.

Maybe not - but /that/ would be moving the goalposts.

All one needs is an implementation that pattern matching logic of
compiler unmistakably recognizes as memove/memcpy. That is very
easily done in standard C. For memmove, I had shown how to do it in
one of the posts below. For memcpy its very obvious, so no need to
show.

But that would /not/ be an efficient implementation of memmove() in
plain portable standard C.

What do I mean by an "efficient" implementation in fully portable
standard C? There are two possible ways to think about that. One is
that the operations on the abstract machine are efficient. The other
is that the code is likely to result in efficient code over a wide
range of real-world compilers, options, and targets.

No, there is no need for wide range of compilers or option.
Standard library (well, may be, I should say "core of standard
library", there is no such thing in the C Standard, but distinctions
exists in many real world implementations, in particular, in gcc) is
compiled with one compiler and one set of options. Or, at most, several
selected sets of options that affect low level code generation, but do
not affect high level optimizations.
Range of targets is indeed desirable, but it does not have to be too
wide.

Besides, you forget that arguments were about theoretical possibility
of writing efficient implementation of memmove() in Standard C, not
about practicality of doing so.
My example achieves that target easily, and even exceeds it, because
it's obvious that required pattern matching is not just theoretically
possible. Existing compilers are capable to handle much more complex
cases. They likely can not handle this particular case, but only
because nobody cared to add few dozens lines of code to compiler's
logic.

And I think it
goes without saying that the implementation must not rely on any
implementation-defined behaviour or anything beyond the minimal
limits given in the C standards, and it must not introduce any new
real or potential UB.

Your "memmove()" implementation fails on several counts. It is
inefficient in the abstract machine - it copies everything twice
instead of once. It is inefficient in real-world implementations of
all sorts and countless targets - being efficient for some compilers
with some options on some targets (most of them hypothetical) does
/not/ qualify as an efficient implementation. And quite clearly it
risks causing failures from stack overflow in situations where the
user would normally expect memmove() to function safely (on
implementations other than those few that turn it into efficient
object code).

They would make it easier to write efficient
implementations of these standard library functions for targets
that had such instructions - but that would be
implementation-specific code. And that is one of the reasons that
C standard library implementations are tied to the specific
compiler and target, and the writers of these libraries have
"superpowers" and are not limited to standard C.

Date	Sujet	#	Auteur
1 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	387	MitchAlsup1
1 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	386	Thomas Koenig
1 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	379	MitchAlsup1
2 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	377	Brett
3 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	376	Lawrence D'Oliveiro
3 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	1	Brett
3 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	1	Anton Ertl
3 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	373	David Brown
3 Oct 24	Byte ordering (was: Whether something is RISC or not)	372	Anton Ertl
3 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	1	David Brown
3 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	369	Lawrence D'Oliveiro
4 Oct 24	Re: Byte ordering	1	Lynn Wheeler
4 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	365	David Brown
4 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	364	Anton Ertl
4 Oct 24	Re: Byte ordering	5	BGB
5 Oct 24	Re: Byte ordering	4	MitchAlsup1
5 Oct 24	Re: Byte ordering	2	BGB
5 Oct 24	Re: Byte ordering	1	Lawrence D'Oliveiro
5 Oct 24	Re: Byte ordering	1	Lawrence D'Oliveiro
5 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	13	Lawrence D'Oliveiro
5 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	12	Brett
5 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	11	Anton Ertl
5 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	10	Michael S
6 Oct 24	Re: Byte ordering	1	Terje Mathisen
6 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	8	Brett
7 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	7	Lawrence D'Oliveiro
7 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	6	Brett
7 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	5	Michael S
7 Oct 24	Re: Byte ordering	2	Stefan Monnier
7 Oct 24	Re: Byte ordering	1	Michael S
7 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	2	Lawrence D'Oliveiro
8 Oct 24	Re: Byte ordering	1	Terje Mathisen
6 Oct 24	Re: Byte ordering	345	David Brown
6 Oct 24	Re: Byte ordering	344	Anton Ertl
6 Oct 24	Re: Byte ordering	189	John Dallman
7 Oct 24	Re: Byte ordering	20	Lawrence D'Oliveiro
8 Oct 24	Re: Byte ordering	19	John Dallman
9 Oct 24	VMS/NT memory management (was: Byte ordering)	1	Stefan Monnier
15 Oct 24	Re: Byte ordering	2	Lawrence D'Oliveiro
15 Oct 24	Re: Byte ordering	1	MitchAlsup1
15 Oct 24	Re: Byte ordering	15	Lawrence D'Oliveiro
15 Oct 24	Re: Byte ordering	3	Michael S
15 Oct 24	Re: Byte ordering	1	John Dallman
18 Oct 24	Re: Byte ordering	1	Lawrence D'Oliveiro
15 Oct 24	Re: Byte ordering	9	John Dallman
16 Oct 24	Re: Byte ordering	7	George Neuner
16 Oct 24	Re: Byte ordering	6	Terje Mathisen
16 Oct 24	Re: Byte ordering	5	David Brown
17 Oct 24	Re: Byte ordering	2	George Neuner
17 Oct 24	Re: Byte ordering	1	David Brown
17 Oct 24	Re: clouds, not Byte ordering	2	John Levine
17 Oct 24	Re: clouds, not Byte ordering	1	David Brown
18 Oct 24	Re: Byte ordering	1	Lawrence D'Oliveiro
16 Oct 24	Re: Byte ordering	2	Paul A. Clayton
18 Oct 24	Re: Microkernels & Capabilities (was Re: Byte ordering)	1	Lawrence D'Oliveiro
7 Oct 24	80286 protected mode	168	Anton Ertl
7 Oct 24	Re: 80286 protected mode	5	Lars Poulsen
7 Oct 24	Re: 80286 protected mode	4	Terje Mathisen
7 Oct 24	Re: 80286 protected mode	1	Michael S
7 Oct 24	Re: 80286 protected mode	2	Lawrence D'Oliveiro
8 Oct 24	Re: 80286 protected mode	1	Terje Mathisen
7 Oct 24	Re: 80286 protected mode	3	Brett
7 Oct 24	Re: 80286 protected mode	2	Michael S
7 Oct 24	Re: 80286 protected mode	1	Brett
7 Oct 24	Re: 80286 protected mode	1	Lawrence D'Oliveiro
8 Oct 24	Re: 80286 protected mode	152	MitchAlsup1
8 Oct 24	Re: 80286 protected mode	4	Lawrence D'Oliveiro
8 Oct 24	Re: 80286 protected mode	3	MitchAlsup1
9 Oct 24	Re: 80286 protected mode	1	David Brown
15 Oct 24	Re: 80286 protected mode	1	Lawrence D'Oliveiro
8 Oct 24	Re: 80286 protected mode	147	Anton Ertl
8 Oct 24	Re: 80286 protected mode	1	Robert Finch
9 Oct 24	Re: 80286 protected mode	145	David Brown
9 Oct 24	Re: 80286 protected mode	79	MitchAlsup1
9 Oct 24	Re: 80286 protected mode	78	David Brown
9 Oct 24	Re: 80286 protected mode	77	Stephen Fuld
10 Oct 24	Re: 80286 protected mode	2	MitchAlsup1
10 Oct 24	Re: 80286 protected mode	1	David Brown
10 Oct 24	Re: 80286 protected mode	1	David Brown
11 Oct 24	Re: 80286 protected mode	73	Tim Rentsch
15 Oct 24	Re: 80286 protected mode	72	Stefan Monnier
15 Oct 24	Re: 80286 protected mode	30	MitchAlsup1
16 Oct 24	Re: 80286 protected mode	25	MitchAlsup1
16 Oct 24	Re: C and turtles, 80286 protected mode	13	John Levine
16 Oct 24	Re: C and turtles, 80286 protected mode	7	MitchAlsup1
16 Oct 24	Re: C and turtles, 80286 protected mode	6	John Levine
17 Oct 24	Re: C and turtles, 80286 protected mode	5	Thomas Koenig
20 Oct 24	Re: C and turtles, 80286 protected mode	4	Lawrence D'Oliveiro
20 Oct 24	Re: C and turtles, 80286 protected mode	3	George Neuner
22 Oct 24	Re: C and turtles, 80286 protected mode	2	Tim Rentsch
22 Oct 24	Re: C and turtles, 80286 protected mode	1	George Neuner
16 Oct 24	Re: C and turtles, 80286 protected mode	1	David Brown
16 Oct 24	Re: C and turtles, 80286 protected mode	4	Paul A. Clayton
17 Oct 24	Re: C and turtles, 80286 protected mode	1	David Brown
20 Oct 24	Re: C and turtles, 80286 protected mode	2	Lawrence D'Oliveiro
20 Oct 24	Re: C and turtles, 80286 protected mode	1	Paul A. Clayton
16 Oct 24	Re: 80286 protected mode	7	Thomas Koenig
16 Oct 24	Re: 80286 protected mode	2	MitchAlsup1
17 Oct 24	Re: 80286 protected mode	1	Tim Rentsch
17 Oct 24	Re: 80286 protected mode	4	Tim Rentsch
17 Oct 24	Re: fine points of dynamic memory allocation, not 80286 protected mode	3	John Levine
17 Oct 24	Re: 80286 protected mode	3	George Neuner
17 Oct 24	Re: 80286 protected mode	1	Tim Rentsch
16 Oct 24	Re: 80286 protected mode	3	David Brown
17 Oct 24	Re: 80286 protected mode	1	Tim Rentsch
16 Oct 24	Re: 80286 protected mode	41	David Brown
9 Oct 24	Re: 80286 protected mode	51	Thomas Koenig
13 Oct 24	Re: 80286 protected mode	14	Anton Ertl
8 Oct 24	Re: 80286 protected mode	6	John Levine
3 Jan 25	Re: Byte ordering	154	Waldek Hebisch
6 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	2	Michael S
3 Oct 24	Re: Byte ordering (was: Whether something is RISC or not)	1	John Dallman
2 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	1	Thomas Koenig
2 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	5	David Schultz
3 Oct 24	Re: Whether something is RISC or not (Re: PDP-8 theology, not Concertina II Progress)	1	Lawrence D'Oliveiro