Article <vkmjtf$3mf98$1@dont-email.me>

Deutsch English Français Italiano
<vkmjtf$3mf98$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: jseigh <jseigh_es00@xemaps.com>
Newsgroups: comp.arch
Subject: Re: Microarchitectural support for counting
Date: Fri, 27 Dec 2024 11:16:47 -0500
Organization: A noiseless patient Spider
Lines: 36
Message-ID: <vkmjtf$3mf98$1@dont-email.me>
References: <2024Oct3.160055@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 27 Dec 2024 17:16:48 +0100 (CET)
Injection-Info: dont-email.me; posting-host="580d0988e850bbad86caf69699f1ac25";
	logging-data="3882280"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/y+FbCqQIgBB0ZXQirVm/F"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:/R1WyvdTznA/vbhPCLy4wOZMSG4=
In-Reply-To: <2024Oct3.160055@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 2383

On 10/3/24 10:00, Anton Ertl wrote:
> Two weeks ago Rene Mueller presented the paper "The Cost of Profiling
> in the HotSpot Virtual Machine" at MPLR 2024.  He reported that for
> some programs the counters used for profiling the program result in
> cache contention due to true or false sharing among threads.
> 
> The traditional software mitigation for that problem is to split the
> counters into per-thread or per-core instances.  But for heavily
> multi-threaded programs running on machines with many cores the cost
> of this mitigation is substantial.
> 

For profiling, do we really need accurate counters?  They just need to
be statistically accurate I would think.

Instead of incrementing a counter, just store a non-zero immediate into
a zero initialized byte array at a per "counter" index.   There's no
rmw data dependency, just a store so should have little impact on
pipeline.

A profiling thread loops thru the byte array, incrementing an actual
counter when it sees no zero byte, and resets the byte to zero.  You
could use vector ops to process the array.

If the stores were fast enough, you could do 2 or more stores at
hashed indices, different hash for each store. Sort of a counting
Bloom filter.  The effective count would be the minimum of the
hashed counts.

No idea how feasible this would be though.

Joe Seigh