Sujet : Re: "Mini" tags to reduce the number of op codes
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 21. Apr 2024, 00:19:53
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v038qo$bmtm$3@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 4/11/24 7:12 PM, MitchAlsup1 wrote:
Scott Lurndal wrote:
[snip]
It seems to me that an offloaded DMA engine would be a far
better way to do memmove (over some threshhold, perhaps a
cache line) without trashing the caches. Likewise memset.
Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.
Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width wide}
If the data was not in L1 cache, only its absence would need to be
determined by the DMA engine. A snoop filter, tag-inclusive L2/L3
probing, or similar mechanism could avoid L1 accesses. Even if the
source or destination for a memory copy was in L1, only one L1
access per cache line might be needed.
I also wonder if the cache fill and/or spill mechanism might be
decoupled from the load/store such that if the cache had enough
banks/subarrays some loads and stores might be done in parallel
with a cache fill or spill/external-read-without-eviction. Tag
checking would limit the utility of such, though tags might also
be banked or access flexibly scheduled (at the cost of choosing a
victim early for fills). Of course, if the cache has such
available bandwidth, why not make it available to the core as well
even if it was rarely useful? (Perhaps higher register bandwidth
might be more difficult than higher cache bandwidth for banking-
friendly patterns?)
Deciding when to bypass cache seems difficult (for both software
developers and hardware). Overwriting cache lines within the same
memory copy is obviously silly. Filling a cache with a memory copy
is also suboptimal, but L1 hardware copy-on-write would probably
be too complicated even with page aligned copies. A copy from
cacheable memory to uncacheable memory (I/O) might be a strong
hint that the source should not be installed into L1 or L2 cache,
but I would guess that not installing the source would often be
the right choice.
I could also imagine a programmer wanting to use memory copy as a
prefetch *directive* for a large chunk of memory (by having source
and destination be the same). This idiom would be easy to detect
(from and to base registers being the same), but may be too niche
to be worth detecting (for most implementations).
(My 66000 might use an idiom with a prefetch instruction preceding
a memory move to indicate the cache level of the destination but
that only manages [some of] the difficulty of the hardware
choice.)
For memset, compression is also an obvious possibility. A memset
might not write any cache lines but rather cache the address range
and the set value and perform hardware copy on access into cache
lines.