Article <v811ub$309dk$1@dont-email.me>

Deutsch English Français Italiano
<v811ub$309dk$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Fri, 26 Jul 2024 15:46:00 -0500
Organization: A noiseless patient Spider
Lines: 246
Message-ID: <v811ub$309dk$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 26 Jul 2024 22:46:04 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1b0414bdcb8c74e1270f1ab69e04f675";
	logging-data="3155380"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19JMpz1BKJC0BwwAgvauVpIRM4lh9UWN4o="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:E8co43BAqBYbkKB10dYbts6SqFU=
Content-Language: en-US
In-Reply-To: <2024Jul26.190007@mips.complang.tuwien.ac.at>
Bytes: 10704

On 7/26/2024 12:00 PM, Anton Ertl wrote:
> "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:
>> On 7/25/2024 1:09 PM, BGB wrote:
>>> At least with a weak model, software knows that if it doesn't go through
>>> the rituals, the memory will be stale.
> 
> There is no guarantee of staleness, only a lack of stronger ordering
> guarantees.
> 

Probably true enough.

The idea here was more about needing to go through the rituals to get 
not-stale data, rather than about any ability to ensure that data is 
stale (not usually a thing).

Granted, if one had deterministically known stale memory, it is probable 
that software would find some way to become dependent on it.


>> The weak model is ideal for me. I know how to program for it
> 
> And the fact that this model is so hard to use that few others know
> how to program for it make it ideal for you.
> 

It is not "that" bad, but one needs to know when to evict stuff from the 
caches.

In my case, mostly for sake of sanity, ended up switching mostly to 
modulo indexing for everything, as while hashed indexing is generally 
more efficient (due to being less negatively effected by the tendency to 
align things), it is less well behaved.

And, for example, hashed indexing in the L1 cache effectively breaks the 
ability to double-map pages and have changes to one mapping be visible 
in the others.



Though, it does lead to a questionable practice (when using 
direct-mapped caches) of "knocking" or "sniping" cache lines:
One can calculate another address based on the first address, where 
accessing this other address will knock the first out of the caches.


If one uses modulo for both, then adding 512K or 1MB to an address, and 
accessing this location, is enough to knock a line out of both the L1 
and L2 caches (with MMU disabled).

Have added an experimental set-associative cache between the L1 and L2 
caches, but initially had to figure out ways to work around its natural 
tendency to break the ability to use these access patterns (essentially 
adding some more rules to the mix).


Though, one possible option (in software) is to simply to try to knock 
more addresses in an attempt to defeat the cache.

Say, try to knock something out of cache with the set-associative thing:
   MOV.Q  (R4, 0x20000), R18  //knock out of L1
   MOV.Q  (R4, 0x40000), R16
   MOV.Q  (R4, 0x50000), R17
   MOV.Q  (R4, 0x60000), R18
   MOV.Q  (R4, 0x70000), R19
   //at this point, it is knocked out to L2
   MOV.Q  (R4, 0x100000), R16  //knock out of L2

But... This doesn't work if the area at 'R4' is a virtual memory pages 
(only works with physical addresses or direct-mapping).


Though, better is to use an "INVDC" instruction followed by a load:
   INVDC  R4
   MOV.Q  (R4), R5  //causes line to be evicted
Which signals to the cache hierarchy the intention to flush this address 
(and also stops the victim-cache module from trying to cache it).

Though, a subsequent load might see whatever was in the cache at the 
time R5 was loaded, if this is undesirable:
   INVDC  R4
   MOV.Q  (R4), R5  //causes dirty line to be evicted
   INVDC  R4        //L1 cache discards non-dirty line
   //next attempt to access address at R4 pulls it from L2 cache.


Currently, there is not a way to express the desire to invalidate a 
specific cache line from the L2 cache (currently L2 invalidation only 
applies to the entire L2 cache). At present, L2 invalidation isn't 
generally needed apart from the RAM-checker and RAM counting (all of the 
other hardware in my current "SoC" existing on the inside edge of the L2 
cache).


Thinking about it, did just go and add intrinsics for these:
   void __mem_invdc(void *ptr);
   void __mem_invic(void *ptr);
Partly as a possible way to reduce temptation to use cache-knocking.



There is the more naive option of picking a random unrelated chunk of 
memory and then doing a series of sequential memory loads across the 
entire memory chunk, but this is less efficient (though generally this 
makes more sense when the intention is to flush the entire cache).

There is currently also the option of using "no cache" addresses, which 
also invoke special behavior in the caches; but these require being in 
supervisor mode or similar (and currently only exists for physical 
addresses).

Or, saying a virtual memory page to no-cache, in which cases every 
access is non-caching.

There is a possible need for "Volatile Load" and "Volatile Store" 
instructions, but these still don't exist.


>> and it's more efficient
> 
> That depends on the hardware.
> 
> Yes, the Alpha 21164 with its imprecise exceptions was "more
> efficient" than other hardware for a while, then the Pentium Pro came
> along and gave us precise exceptions and more efficiency.  And
> eventually the Alpha people learned the trick, too, and 21264 provided
> precise exceptions (although they did not admit this) and more
> efficieny.
> 
> Similarly, I expect that hardware that is designed for good TSO or
> sequential consistency performance will run faster on code written for
> this model than code written for weakly consistent hardware will run
> on that hardware.  That's because software written for weakly
> consistent hardware often has to insert barriers or atomic operations
> just in case, and these operations are slow on hardware optimized for
> weak consistency.
> 

TSO requires more significant hardware complexity though.

Seems like it would be harder to debug the hardware since:
   There is more that has to go on in the hardware for TSO to work;
   Software will have higher expectations that it actually work.


Though, if it did work, one could potentially use "stronger" caching in 
some areas, since the caching would not interfere with the ability of 
software to maintain memory consistency.


> By contrast, one can design hardware for strong ordering such that the
> slowness occurs only in those cases when actual (not potential)
> communication between the cores happens, i.e., much less frequently.
> 
>> and sometimes use cases do not care if they encounter "stale" data.
> 
> Great.  Unless these "sometimes" cases are more often than the cases
> where you perform some atomic operation or barrier because of
> potential, but not actual communication between cores, the weak model
> is still slower than a well-implemented strong model.
> 


Atomic operations are still a "needs work" area.

Barrier: No proper barrier instruction exists as of yet in BJX2, but 
there are some (weaker) cache-invalidation instructions.

There is the RISC-V FENCE.I instruction, but as-is this gets turned into 
an exception (seems to be allowed based on what I read).

The exception handler will then generally proceed to flush the entire 
cache (so is not a high performance option).



So, in BJX2, there is:
========== REMAINDER OF ARTICLE TRUNCATED ==========