Deutsch English Français Italiano |
<v811ub$309dk$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Arguments for a sane ISA 6-years later Date: Fri, 26 Jul 2024 15:46:00 -0500 Organization: A noiseless patient Spider Lines: 246 Message-ID: <v811ub$309dk$1@dont-email.me> References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org> <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me> <2024Jul26.190007@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 26 Jul 2024 22:46:04 +0200 (CEST) Injection-Info: dont-email.me; posting-host="1b0414bdcb8c74e1270f1ab69e04f675"; logging-data="3155380"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19JMpz1BKJC0BwwAgvauVpIRM4lh9UWN4o=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:E8co43BAqBYbkKB10dYbts6SqFU= Content-Language: en-US In-Reply-To: <2024Jul26.190007@mips.complang.tuwien.ac.at> Bytes: 10704 On 7/26/2024 12:00 PM, Anton Ertl wrote: > "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes: >> On 7/25/2024 1:09 PM, BGB wrote: >>> At least with a weak model, software knows that if it doesn't go through >>> the rituals, the memory will be stale. > > There is no guarantee of staleness, only a lack of stronger ordering > guarantees. > Probably true enough. The idea here was more about needing to go through the rituals to get not-stale data, rather than about any ability to ensure that data is stale (not usually a thing). Granted, if one had deterministically known stale memory, it is probable that software would find some way to become dependent on it. >> The weak model is ideal for me. I know how to program for it > > And the fact that this model is so hard to use that few others know > how to program for it make it ideal for you. > It is not "that" bad, but one needs to know when to evict stuff from the caches. In my case, mostly for sake of sanity, ended up switching mostly to modulo indexing for everything, as while hashed indexing is generally more efficient (due to being less negatively effected by the tendency to align things), it is less well behaved. And, for example, hashed indexing in the L1 cache effectively breaks the ability to double-map pages and have changes to one mapping be visible in the others. Though, it does lead to a questionable practice (when using direct-mapped caches) of "knocking" or "sniping" cache lines: One can calculate another address based on the first address, where accessing this other address will knock the first out of the caches. If one uses modulo for both, then adding 512K or 1MB to an address, and accessing this location, is enough to knock a line out of both the L1 and L2 caches (with MMU disabled). Have added an experimental set-associative cache between the L1 and L2 caches, but initially had to figure out ways to work around its natural tendency to break the ability to use these access patterns (essentially adding some more rules to the mix). Though, one possible option (in software) is to simply to try to knock more addresses in an attempt to defeat the cache. Say, try to knock something out of cache with the set-associative thing: MOV.Q (R4, 0x20000), R18 //knock out of L1 MOV.Q (R4, 0x40000), R16 MOV.Q (R4, 0x50000), R17 MOV.Q (R4, 0x60000), R18 MOV.Q (R4, 0x70000), R19 //at this point, it is knocked out to L2 MOV.Q (R4, 0x100000), R16 //knock out of L2 But... This doesn't work if the area at 'R4' is a virtual memory pages (only works with physical addresses or direct-mapping). Though, better is to use an "INVDC" instruction followed by a load: INVDC R4 MOV.Q (R4), R5 //causes line to be evicted Which signals to the cache hierarchy the intention to flush this address (and also stops the victim-cache module from trying to cache it). Though, a subsequent load might see whatever was in the cache at the time R5 was loaded, if this is undesirable: INVDC R4 MOV.Q (R4), R5 //causes dirty line to be evicted INVDC R4 //L1 cache discards non-dirty line //next attempt to access address at R4 pulls it from L2 cache. Currently, there is not a way to express the desire to invalidate a specific cache line from the L2 cache (currently L2 invalidation only applies to the entire L2 cache). At present, L2 invalidation isn't generally needed apart from the RAM-checker and RAM counting (all of the other hardware in my current "SoC" existing on the inside edge of the L2 cache). Thinking about it, did just go and add intrinsics for these: void __mem_invdc(void *ptr); void __mem_invic(void *ptr); Partly as a possible way to reduce temptation to use cache-knocking. There is the more naive option of picking a random unrelated chunk of memory and then doing a series of sequential memory loads across the entire memory chunk, but this is less efficient (though generally this makes more sense when the intention is to flush the entire cache). There is currently also the option of using "no cache" addresses, which also invoke special behavior in the caches; but these require being in supervisor mode or similar (and currently only exists for physical addresses). Or, saying a virtual memory page to no-cache, in which cases every access is non-caching. There is a possible need for "Volatile Load" and "Volatile Store" instructions, but these still don't exist. >> and it's more efficient > > That depends on the hardware. > > Yes, the Alpha 21164 with its imprecise exceptions was "more > efficient" than other hardware for a while, then the Pentium Pro came > along and gave us precise exceptions and more efficiency. And > eventually the Alpha people learned the trick, too, and 21264 provided > precise exceptions (although they did not admit this) and more > efficieny. > > Similarly, I expect that hardware that is designed for good TSO or > sequential consistency performance will run faster on code written for > this model than code written for weakly consistent hardware will run > on that hardware. That's because software written for weakly > consistent hardware often has to insert barriers or atomic operations > just in case, and these operations are slow on hardware optimized for > weak consistency. > TSO requires more significant hardware complexity though. Seems like it would be harder to debug the hardware since: There is more that has to go on in the hardware for TSO to work; Software will have higher expectations that it actually work. Though, if it did work, one could potentially use "stronger" caching in some areas, since the caching would not interfere with the ability of software to maintain memory consistency. > By contrast, one can design hardware for strong ordering such that the > slowness occurs only in those cases when actual (not potential) > communication between the cores happens, i.e., much less frequently. > >> and sometimes use cases do not care if they encounter "stale" data. > > Great. Unless these "sometimes" cases are more often than the cases > where you perform some atomic operation or barrier because of > potential, but not actual communication between cores, the weak model > is still slower than a well-implemented strong model. > Atomic operations are still a "needs work" area. Barrier: No proper barrier instruction exists as of yet in BJX2, but there are some (weaker) cache-invalidation instructions. There is the RISC-V FENCE.I instruction, but as-is this gets turned into an exception (seems to be allowed based on what I read). The exception handler will then generally proceed to flush the entire cache (so is not a high performance option). So, in BJX2, there is: ========== REMAINDER OF ARTICLE TRUNCATED ==========