| Deutsch English Français Italiano |
|
<2024Dec26.155630@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Microarchitectural support for counting Date: Thu, 26 Dec 2024 14:56:30 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 45 Message-ID: <2024Dec26.155630@mips.complang.tuwien.ac.at> References: <2024Oct3.160055@mips.complang.tuwien.ac.at> <vkjan3$2u92h$1@dont-email.me> Injection-Date: Thu, 26 Dec 2024 16:17:33 +0100 (CET) Injection-Info: dont-email.me; posting-host="fe94df56a285672d25a16cc71e6c80f2"; logging-data="3188085"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18N/jDSVB1C5TJD1+UdVyAZ" Cancel-Lock: sha1:l5ToZbaja6IB+iQAbBkJt5l7QEY= X-newsreader: xrn 10.11 Bytes: 3304 "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes: >On 10/3/2024 7:00 AM, Anton Ertl wrote: >> Two weeks ago Rene Mueller presented the paper "The Cost of Profiling >> in the HotSpot Virtual Machine" at MPLR 2024. He reported that for >> some programs the counters used for profiling the program result in >> cache contention due to true or false sharing among threads. >> >> The traditional software mitigation for that problem is to split the >> counters into per-thread or per-core instances. But for heavily >> multi-threaded programs running on machines with many cores the cost >> of this mitigation is substantial. .... >> For the HotSpot application, the >> eventual answer was that they live with the cost of cache contention >> for the programs that have that problem. After some minutes the hot >> parts of the program are optimized, and cache contention is no longer >> a problem. .... >If the per-thread counters are properly padded to a l2 cache line and >properly aligned on cache line boundaries, well, the should not cause >false sharing with other cache lines... Right? Sure, that's what the first sentence of the second paragraph you cited (and which I cited again) is about. Next, read the next sentence. Maybe I should give an example (fully made up on the spot, read the paper for real numbers): If HotSpot uses, on average one counter per conditional branch, and assuming a conditional branch every 10 static instructions (each having, say 4 bytes), with 1MB of generated code and 8 bytes per counter, that's 200KB of counters. But these counters are shared between all threads, so for code running on many cores you get true and false sharing. As mentioned, the usual mitigation is per-core counters. With a 256-core machine, we now have 51.2MB of counters for 1MB of executable code. Now this is Java, so there might be quite a bit more executable code and correspondingly more counters. They eventually decided that the benefit of reduced cache coherence traffic is not worth that cost (or the cost of a hardware mechanism), as described in the last paragraph, from which I cited the important parts. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>