Sujet : Re: rep movsb vs. simpler instructions for memcpy/memmove
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 14. Mar 2025, 14:18:37
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2025Mar14.141837@mips.complang.tuwien.ac.at>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : xrn 10.11
Michael S <
already5chosen@yahoo.com> writes:
On Thu, 13 Mar 2025 21:42:25 GMT
scott@slp53.sl.home (Scott Lurndal) wrote:
>
Michael S <already5chosen@yahoo.com> writes:
On Thu, 13 Mar 2025 19:35:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
And they have "So Many" extra burdens, such as when from is MMI/O
space access and to is cache coherent, and all sorts of other self
imposed problems.
>
This case is pretty useful in practice.
Although mostly done with DMA controllers in these modern times
to offload from the CPU.
>
For up to few hundreds bytes it would be slower. For few thousands byte
it could be faster at transfer level, but data ends up in the wrong
place in the memory hierarchy, too far away from the ultimate consumer,
so still slower from the "full job done" perspective.
And CPU time that you "saved" by offload is almost always just uselessly
wasted in idle loop.
The usual case where "from" is memory-mapped I/O and "to" is
cache-coherent is when loading from an NVME SSD. AFAIK this is
usually done in larger block sizes, because of the overhead of setting
up the DMA, and is usually done in an asynchronous way.
As for the wrong level: The DMA engine transfers the data to the CPU
chip in any case: it contains all caches and the DRAM controller. It
might put the data in, e.g., L3 cache, marked dirty, for later
writeback to DRAM, and if a CPU accesses that memory soon, it will
only see the latency and bandwidth limits of L3.
I have certainly read about a project for high-speed network routing
where the network cards deliver the packets to L3, and the routing
software has to process each packet in an average of 70ns; if the
packets were delivered to DRAM, that speed would be impossible.
As for the "transfer level speed", I would not know why delivering to
DRAM should be faster than delivering to L3, L2, or L1. On the
contrary, it seems to me that delivering to DRAM is at least as slow
as the other variants.
In any case, that's not what most uses of memcpy() or memmove(), or
rep movsb with their synchronous interfaces are about.
Using MTRRs one can switch the kind of memory
to and from point in the middle of a REP MOVs.
>
How exactly?
The REP MOV straddles the boundary between two MTRRs.
>
Crossing boundary that way can typically be predicted far in advance,
so not really big problem.
It does not happen in practice, so making it fast or "optimal" by
using a prediction is not necessary.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>