| Deutsch English Français Italiano |
|
<2024Oct3.160055@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Microarchitectural support for counting Date: Thu, 03 Oct 2024 14:00:55 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 75 Message-ID: <2024Oct3.160055@mips.complang.tuwien.ac.at> Injection-Date: Thu, 03 Oct 2024 17:37:07 +0200 (CEST) Injection-Info: dont-email.me; posting-host="edb3a3d11159eaa50edc5310324eeb6b"; logging-data="3983317"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+HzZ5SYK5DDZWIGJBYNIqt" Cancel-Lock: sha1:RRTBhQEnC+cRTbNAIrH+LcMQ+rI= X-newsreader: xrn 10.11 Bytes: 4841 Two weeks ago Rene Mueller presented the paper "The Cost of Profiling in the HotSpot Virtual Machine" at MPLR 2024. He reported that for some programs the counters used for profiling the program result in cache contention due to true or false sharing among threads. The traditional software mitigation for that problem is to split the counters into per-thread or per-core instances. But for heavily multi-threaded programs running on machines with many cores the cost of this mitigation is substantial. So I thought up the idea of doing a similar thing in the caches (as a hardware mechanism), and Rene Mueller indicated that he had been thinking in the same direction, but his hardware people were not interested. In any case, here's how this might work. First, you need an add-to-memory instruction that does not need to know anything about the result (so the existing AMD64 instruction is not enough, thanks to it producing a flags result). Have cache consistency states "accumulator64", "accumulator32", "accumulator16", "accumulator8" (in addition to MOESI), which indicate that the cache line contains naturally aligned 64-bit, 32-bit, 16-bit, or 8-bit counters respectively. Not all of these states need to be supported. You also add a state shared-plus-accumulators (SPA). The architectural value of the values in all these cache lines is the sum of the SPA line (or main memory) plus all the accumulator lines. An n-bit add-to-memory adds that value to an accumulator-n line. If there is no such line, such a line is allocated, initialized to zero, and the add then stores the increment in the corresponding part. When allocating the accumulator line, an existing line may be forced to switch to SPA, and/or may be moved to an outwards in the cache. But if the add applies to some local line in exclusive or modified state, it's probably better to just update that line without any accumulator stuff. If there is a read access to such memory, all the accumulator lines are summed up and added with an SPA line (or main memory); this is relatively expensive, so this whole thing makes most sense if the programmer can arrange to have many additions relative to reads or writes. The SPA line is shared, so we keep its contents and the contents of the accumulator lines unchanged. For writes various options are possible; the catchall would be to add all the accumulator lines for that address to one of the SPA lines of that memory, overwrite the memory there, and broadcast the new line contents to the other SPA lines or invalidate them, and zero or invalidate all the accumulator lines. Another option is to write the value to one SPA copy (and invalidate the other SPA lines), and zero the corresponding bytes in the accumulator lines; this only works if there are no accumulators wider than the write. You will typically support the accumulator states only at L1 and maybe L2; if an accumulator line gets cool enough to be evicted from that, it can be added to the SPA line or to main memory. How do we enter these states? Given that every common architecture needs special instructions for using them, use of these instructions on a shared cache line or modified or exclusive line of another core would be a hint that using these states is a good idea. This is all a rather elaborate mechanism. Are counters in multi-threaded programs used enough (and read rarely enough) to justify the cost of implementing it? For the HotSpot application, the eventual answer was that they live with the cost of cache contention for the programs that have that problem. After some minutes the hot parts of the program are optimized, and cache contention is no longer a problem. But maybe there are some other applications that do more long-time accumulating that would benefit. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>