Article <vkoqe1$aepr$1@dont-email.me>

Deutsch English Français Italiano
<vkoqe1$aepr$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jseigh <jseigh_es00@xemaps.com>
Newsgroups: comp.arch
Subject: Re: Microarchitectural support for counting
Date: Sat, 28 Dec 2024 07:20:17 -0500
Organization: A noiseless patient Spider
Lines: 41
Message-ID: <vkoqe1$aepr$1@dont-email.me>
References: <2024Oct3.160055@mips.complang.tuwien.ac.at>
 <vkmjtf$3mf98$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 28 Dec 2024 13:20:17 +0100 (CET)
Injection-Info: dont-email.me; posting-host="de1fb852e40e1b01f12326cb43be8f77";
	logging-data="342843"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX191TcDeyWSPoDQ2EwFQTpRg"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:yXunMaP//72dbC+emxSys2j7+tI=
In-Reply-To: <vkmjtf$3mf98$1@dont-email.me>
Content-Language: en-US
Bytes: 2782

On 12/27/24 11:16, jseigh wrote:
> On 10/3/24 10:00, Anton Ertl wrote:
>> Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
>> in the HotSpot Virtual Machine" at MPLR 2024.  He reported that for
>> some programs the counters used for profiling the program result in
>> cache contention due to true or false sharing among threads.
>>
>> The traditional software mitigation for that problem is to split the
>> counters into per-thread or per-core instances.  But for heavily
>> multi-threaded programs running on machines with many cores the cost
>> of this mitigation is substantial.
>>
> 
> For profiling, do we really need accurate counters?  They just need to
> be statistically accurate I would think.
> 
> Instead of incrementing a counter, just store a non-zero immediate into
> a zero initialized byte array at a per "counter" index.   There's no
> rmw data dependency, just a store so should have little impact on
> pipeline.
> 
> A profiling thread loops thru the byte array, incrementing an actual
> counter when it sees no zero byte, and resets the byte to zero.  You
> could use vector ops to process the array.
> 
> If the stores were fast enough, you could do 2 or more stores at
> hashed indices, different hash for each store. Sort of a counting
> Bloom filter.  The effective count would be the minimum of the
> hashed counts.
> 
> No idea how feasible this would be though.
> 

Probably not feasible.  The polling frequency wouldn't be high enough.


If the problem is the number of counters, then counting Bloom filters
might be worth looking into, assuming the overhead of incrementing
the counts isn't a problem.

Joe Seigh