Sujet : Re: Microarchitectural support for counting
De : anton (at) *nospam* mips.complang.tuwien.ac.at (Anton Ertl)
Groupes : comp.archDate : 26. Dec 2024, 15:56:30
Autres entêtes
Organisation : Institut fuer Computersprachen, Technische Universitaet Wien
Message-ID : <2024Dec26.155630@mips.complang.tuwien.ac.at>
References : 1 2
User-Agent : xrn 10.11
"Chris M. Thomasson" <
chris.m.thomasson.1@gmail.com> writes:
On 10/3/2024 7:00 AM, Anton Ertl wrote:
Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
in the HotSpot Virtual Machine" at MPLR 2024. He reported that for
some programs the counters used for profiling the program result in
cache contention due to true or false sharing among threads.
The traditional software mitigation for that problem is to split the
counters into per-thread or per-core instances. But for heavily
multi-threaded programs running on machines with many cores the cost
of this mitigation is substantial.
...
For the HotSpot application, the
eventual answer was that they live with the cost of cache contention
for the programs that have that problem. After some minutes the hot
parts of the program are optimized, and cache contention is no longer
a problem.
...
If the per-thread counters are properly padded to a l2 cache line and
properly aligned on cache line boundaries, well, the should not cause
false sharing with other cache lines... Right?
Sure, that's what the first sentence of the second paragraph you cited
(and which I cited again) is about. Next, read the next sentence.
Maybe I should give an example (fully made up on the spot, read the
paper for real numbers): If HotSpot uses, on average one counter per
conditional branch, and assuming a conditional branch every 10 static
instructions (each having, say 4 bytes), with 1MB of generated code
and 8 bytes per counter, that's 200KB of counters. But these counters
are shared between all threads, so for code running on many cores you
get true and false sharing.
As mentioned, the usual mitigation is per-core counters. With a
256-core machine, we now have 51.2MB of counters for 1MB of executable
code. Now this is Java, so there might be quite a bit more executable
code and correspondingly more counters. They eventually decided that
the benefit of reduced cache coherence traffic is not worth that cost
(or the cost of a hardware mechanism), as described in the last
paragraph, from which I cited the important parts.
- anton
-- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>