Article <2024Dec26.155630@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Dec26.155630@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Microarchitectural support for counting
Date: Thu, 26 Dec 2024 14:56:30 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 45
Message-ID: <2024Dec26.155630@mips.complang.tuwien.ac.at>
References: <2024Oct3.160055@mips.complang.tuwien.ac.at> <vkjan3$2u92h$1@dont-email.me>
Injection-Date: Thu, 26 Dec 2024 16:17:33 +0100 (CET)
Injection-Info: dont-email.me; posting-host="fe94df56a285672d25a16cc71e6c80f2";
	logging-data="3188085"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18N/jDSVB1C5TJD1+UdVyAZ"
Cancel-Lock: sha1:l5ToZbaja6IB+iQAbBkJt5l7QEY=
X-newsreader: xrn 10.11
Bytes: 3304

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>On 10/3/2024 7:00 AM, Anton Ertl wrote:
>> Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
>> in the HotSpot Virtual Machine" at MPLR 2024.  He reported that for
>> some programs the counters used for profiling the program result in
>> cache contention due to true or false sharing among threads.
>> 
>> The traditional software mitigation for that problem is to split the
>> counters into per-thread or per-core instances.  But for heavily
>> multi-threaded programs running on machines with many cores the cost
>> of this mitigation is substantial.
....
>> For the HotSpot application, the
>> eventual answer was that they live with the cost of cache contention
>> for the programs that have that problem.  After some minutes the hot
>> parts of the program are optimized, and cache contention is no longer
>> a problem.
....
>If the per-thread counters are properly padded to a l2 cache line and 
>properly aligned on cache line boundaries, well, the should not cause 
>false sharing with other cache lines... Right?

Sure, that's what the first sentence of the second paragraph you cited
(and which I cited again) is about.  Next, read the next sentence.

Maybe I should give an example (fully made up on the spot, read the
paper for real numbers): If HotSpot uses, on average one counter per
conditional branch, and assuming a conditional branch every 10 static
instructions (each having, say 4 bytes), with 1MB of generated code
and 8 bytes per counter, that's 200KB of counters.  But these counters
are shared between all threads, so for code running on many cores you
get true and false sharing.

As mentioned, the usual mitigation is per-core counters.  With a
256-core machine, we now have 51.2MB of counters for 1MB of executable
code.  Now this is Java, so there might be quite a bit more executable
code and correspondingly more counters.  They eventually decided that
the benefit of reduced cache coherence traffic is not worth that cost
(or the cost of a hardware mechanism), as described in the last
paragraph, from which I cited the important parts.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>