Sujet : Re: Another security vulnerability
De : paaronclayton (at) *nospam* gmail.com (Paul A. Clayton)
Groupes : comp.archDate : 29. Mar 2024, 02:51:51
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <uu56rq$3u2ve$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0
On 3/28/24 3:59 PM, MitchAlsup1 wrote:
EricP wrote:
[snip]
(d) any new pointers are virtual address that have to start back at
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
This is the tidbit that prevents doing prefetches at/in the DRAM controller.
The address so fetched needs translation !! And this requires dragging
stuff over to DRC that is not normally done.
With multiple memory channels having independent memory
controllers (a reasonable design I suspect), a memory controller
may have to send the prefetch request to another memory controller
anyway. If the prefetch has to take a trip on the on-chip network,
a "minor side trip" for translation might not be horrible (though
it seems distasteful to me).
With the Mill having translation at last level cache miss, such
prefetching may be more natural *but* distributing the virtual
address translations and the memory controllers seems challenging
when one wants to minimize hops.
[snip]
BUT, as PIUMA proposes, we also allow the memory subsystem to read and write
individual aligned 8-byte values from DRAM, rather than whole cache lines,
so we only move that actual 8 bytes values we need.
Busses on cores are reaching the stage where an entire cache line
is transferred in 1-cycle. With such busses, why define anything smaller than a cache line ?? {other than uncacheable accesses}
The Intel research chip was special-purpose targeting cache-unfriendly code. Reading 64 bytes when 99% of the time 56 bytes would be unused is rather wasteful (and having more memory channels helps under high thread count).
However, even for a "general purpose" processor, "word"-granular
atomic operations could justify not having all data transfers be
cache line size. (Such are rare compared with cache line loads
from memory or other caches, but a design might have narrower
connections for coherence, interrupts, etc. that could be used for
small data communication.)
In-cache compression might also nudge the tradeoffs. Being able to
have higher effective bandwidth when data is transmitted in a
compressed form might be useful. "Lossy compression", where the
recipient does not care about much of the data, would allow
compression even when the data itself is not compressible. For
contiguous useful data, this is comparable to a smaller cache
line.