Liste des Groupes | Revenir à c arch |
MitchAlsup1 wrote:It seems to me that once the core has identified an address and an offset
from that address contains another address (foo->next, foo->prev) that only those are prefetched. So this depends on placing next as the first
container in a structure and remains dependent on chasing next a lot more
often than chasing prev.
Otherwise, knowing a loaded value contains a pointer to a structure (or array)
one cannot predict what to prefetch unless one can assume the offset into the
struct (or array).
Right, this is the problem that these "data memory-dependent" prefetchers
like described in that Intel Programmable and Integrated Unified Memory
Architecture (PIUMA)" paper referenced by Paul Clayton are trying to solve.
The pointer field to chase can be
(a) at an +- offset from the current pointer virtual address
(b) at a different offset for each iteration
(c) conditional on some other field at some other offset
and most important:
(d) any new pointers are virtual address that have to start back atThis is the tidbit that prevents doing prefetches at/in the DRAM controller.
the Load Store Queue for VA translation and forwarding testing
after applying (a),(b) and (c) above.
Since each chased pointer starts back at LSQ, the cost is no differentLatency cost is identical, instruction issue/retire costs are lower.
than an explicit Prefetch instruction, except without (a),(b) and (c)
having been applied first.
So I find the simplistic, blithe data-dependent auto prefetchingK9 built a SW model of such a prefetcher. For the first 1B cycles of a
described as questionable.
Now Note:: If there were an instruction that loaded the value known to be
a pointer and prefetched based on the received pointer, then the prefetch
is now architectural not µArchitectural and you are allowed to damage the
cache or TLB when/after the instruction retires.
In the PIUMA case those pointers were to sparse data sets
so part of the problem was rolling over the cache, as well as
(and the PIUMA paper didn't mention this) the TLB.
After reading the PIUMA paper I had an idea for a small modification
to the PTE cache control bits to handle sparse data. The PTE's 3 CC bits
can specify the upper page table levels are cached in the TLB but
lower levels are not because they would always roll over the TLB.
However the non-TLB cached PTE's may optionally still be cached
in L1 or L2, or not at all.
This allows one to hold the top page table levels in the TLB,Given the 2-level TLBs currently in vogue, the fist level might have 32-64 PTEs, while the second might have 2048 PTEs. With this
the upper middle levels in L1, lower middle levels in L2,
and leaf PTE's and sparse code/data not cached at all.
BUT, as PIUMA proposes, we also allow the memory subsystem to read and writeBusses on cores are reaching the stage where an entire cache line
individual aligned 8-byte values from DRAM, rather than whole cache lines,
so we only move that actual 8 bytes values we need.
Also note that page table walks are also graph structure walksYes, this is why the K9 prefetcher was in the L2 where it had access
but chasing physical addresses at some simple calculated offsets.
These physical addresses might be cached in L1 or L2 so we can't
just chase these pointers in the memory controller but,
if one wants to do this, have to do so in the cache controller.
Les messages affichés proviennent d'usenet.