Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
So I thought up the idea of doing a similar thing in the caches (as a
hardware mechanism), and Rene Mueller indicated that he had been
thinking in the same direction, but his hardware people were not
interested.
In any case, here's how this might work. First, you need an
add-to-memory instruction that does not need to know anything about
the result (so the existing AMD64 instruction is not enough, thanks to
it producing a flags result). Have cache consistency states
"accumulator64", "accumulator32", "accumulator16", "accumulator8" (in
addition to MOESI), which indicate that the cache line contains
naturally aligned 64-bit, 32-bit, 16-bit, or 8-bit counters
respectively. Not all of these states need to be supported. You also
add a state shared-plus-accumulators (SPA).
The architectural value of the values in all these cache lines is the
sum of the SPA line (or main memory) plus all the accumulator lines.
An n-bit add-to-memory adds that value to an accumulator-n line. If
there is no such line, such a line is allocated, initialized to zero,
and the add then stores the increment in the corresponding part. When
allocating the accumulator line, an existing line may be forced to
switch to SPA, and/or may be moved to an outwards in the cache. But
if the add applies to some local line in exclusive or modified state,
it's probably better to just update that line without any accumulator
stuff.
If there is a read access to such memory, all the accumulator lines
are summed up and added with an SPA line (or main memory); this is
relatively expensive, so this whole thing makes most sense if the
programmer can arrange to have many additions relative to reads or
writes. The SPA line is shared, so we keep its contents and the
contents of the accumulator lines unchanged.
For writes various options are possible; the catchall would be to add
all the accumulator lines for that address to one of the SPA lines of
that memory, overwrite the memory there, and broadcast the new line
contents to the other SPA lines or invalidate them, and zero or
invalidate all the accumulator lines. Another option is to write the
value to one SPA copy (and invalidate the other SPA lines), and zero
the corresponding bytes in the accumulator lines; this only works if
there are no accumulators wider than the write.
You will typically support the accumulator states only at L1 and maybe
L2; if an accumulator line gets cool enough to be evicted from that,
it can be added to the SPA line or to main memory.
How do we enter these states? Given that every common architecture
needs special instructions for using them, use of these instructions
on a shared cache line or modified or exclusive line of another core
would be a hint that using these states is a good idea.
This is all a rather elaborate mechanism. Are counters in
multi-threaded programs used enough (and read rarely enough) to
justify the cost of implementing it? For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem. But maybe there are some other applications that do more
long-time accumulating that would benefit.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>