Article <vh8cbo$3j8c5$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <vh8cbo$3j8c5$1@dont-email.me>
Deutsch English Français Italiano
<vh8cbo$3j8c5$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Memory ordering
Date: Fri, 15 Nov 2024 14:53:00 -0600
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <vh8cbo$3j8c5$1@dont-email.me>
References: <vfono1$14l9r$1@dont-email.me> <vgm4vj$3d2as$1@dont-email.me>
 <vgm5cb$3d2as$3@dont-email.me> <YfxXO.384093$EEm7.56154@fx16.iad>
 <vh4530$2mar5$1@dont-email.me>
 <-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>
 <vh5t5b$312cl$2@dont-email.me>
 <5yqdnU9eL_Y_GKv6nZ2dnZfqn_GdnZ2d@supernews.com>
 <2024Nov15.082512@mips.complang.tuwien.ac.at> <vh7rlr$3fu9i$1@dont-email.me>
 <2024Nov15.182737@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 15 Nov 2024 21:53:13 +0100 (CET)
Injection-Info: dont-email.me; posting-host="00d436b94db1d6c2525abde220a5befd";
	logging-data="3776901"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX180PQxyVNjLVIvHabh50NGy4z0VXzclQxs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Yn9yrdAFB8fkmY0ZoI61jOgpQ4k=
Content-Language: en-US
In-Reply-To: <2024Nov15.182737@mips.complang.tuwien.ac.at>
Bytes: 7056

On 11/15/2024 11:27 AM, Anton Ertl wrote:
> jseigh <jseigh_es00@xemaps.com> writes:
>> Anybody doing that sort of programming, i.e. lock-free or distributed
>> algorithms, who can't handle weakly consistent memory models, shouldn't
>> be doing that sort of programming in the first place.
> 
> Do you have any argument that supports this claim.
> 
>> Strongly consistent memory won't help incompetence.
> 
> Strong words to hide lack of arguments?
> 

In my case, as I see it:
   The tradeoff is more about implementation cost, performance, etc.

Weak model:
   Cheaper (and simpler) to implement;
   Performs better when there is no need to synchronize memory;
   Performs worse when there is need to synchronize memory;
   ...



However, local to the CPU core:
   Not respecting things like RAW hazards does not seem well advised.


Like, if we store to a location, and then immediately read back from it, 
one can expect to see the most recently written value, not the previous 
value. Or, if one stores to two adjacent memory locations, one expects 
that both stores write the data correctly.

Granted, it is a tradeoff:
   Not bothering: Fast, Cheap, but may break expected behavior;
     Could naively use NOPs if aliasing is possible, but this is bad.
   Add an interlock check, stall the pipeline if it happens:
     Works, but can add a noticeable performance penalty;
     My attempts at 75 and 100 MHz cores had often done this;
     Sadly, memory RAW and WAW hazards are not exactly rare.
   Use internal forwarding, so written data is used directly next cycle.
     Better performance;
     But, has a fairly high cost for the FPGA (*1).



*1: This factor (along with L1 cache sizes) weighs in heavily to why I 
continue to use 50MHz. Otherwise, I could use 75 MHz, but this internal 
forwarding logic, and L1 caches with 32K of BRAM (excluding metadata) 
and 1-cycle access, are not really viable at 75 MHz.

For the L2 cache, which is much bigger, one can use a few extra 
pad-cycles to access the Block-RAM array. Though, 5 cycle latency for 
Load/Store operations would be, not good.

Can note that with Block-RAM, usual behavior seems to be that if one 
tries to read from one port while writing to another port on the same 
clock edge, if both are at the same location, the prior contents will be 
returned. This may be a general behavior in Verilog though, rather than 
a Block-RAM thing (also seems to apply to LUTRAM if accessed in the same 
pattern; though LUTRAM allows also reading the value via combinatorial 
logic rather than a clock-edge, which seems to always return the value 
from the most recent clock-edge).


As I can note, a 4K or 8K L1 cache with stall on RAW or WAW, at 75 MHz, 
tends to perform worse IME, than a 32K cache running at 50 MHz with no 
RAW/WAW stall.

Also, trying to increase MHz by increasing instruction latency in many 
cases was also not ideal for performance.


Granted, if I were to do things the "DEC Alpha" way, I probably could 
run stuff at 75MHz, but then would likely need the compiler to insert a 
bunch of strategic NOPs so that the program doesn't break.


For memory ordering, possibly, in my case a case could be made for an 
"order respecting DRAM cache" via the MMIO interface, say:
   F000_01000000..F000_3FFFFFFF

Could be defined to alias with the main RAM map, but with strictly 
sequential ordering for every memory access across all cores (at the 
expense of performance).

Where:
   0000_00000000..7FFF_FFFFFFFF: Virtual Address Space
   8000_00000000..BFFF_FFFFFFFF: Supervisor-Only Virtual Address Space
   C000_00000000..CFFF_FFFFFFFF: Physical Address Space, Default Caching
   D000_00000000..DFFF_FFFFFFFF: Physical Address Space, Volatile/NoCache
   E000_00000000..EFFF_FFFFFFFF: Reserved
   F000_00000000..FFFF_FFFFFFFF: MMIO Space

MMIO space is currently fully independent of RAM space.

However, at present:
   FFFF_F0000000..FFFF_FFFFFFFF: MMIO Space, as Used for MMIO devices.

So, in theory, remerging RAM IO space into MMIO Space would be possible 
(well, except that trying to access HW MMIO address ranges via RAM-space 
access would likely be disallowed).


Can note, MMU disabled:
   0000_00000000..0FFF_FFFFFFFF: Same as C000..CFFF space.
   1000_00000000..7FFF_FFFFFFFF: Invalid

....

Granted, current scheme does set a limit of 16TB of RAM.
   But, biggest FPGA boards I have only have 256MB, so, ...

And, current VA map within TestKern (from memory):
   0000_00000000..0000_00FFFFFF: NULL Space
   0000_01000000..0000_3FFFFFFF: RAM Range (Identity Mapped)
   0000_40000000..0000_BFFFFFFF: Direct Page Mapping (no swap)
   0001_00000000..3FFF_FFFFFFFF: Mapped to swapfile, Global
   4000_00000000..7FFF_FFFFFFFF: Process Local


Note that, within the RAM-range, the RAM will wrap around. The specifics 
of the wraparound are used to detect RAM size (this would set an 
effective limit at 512MB, after which no wraparound would be detected).

Specifics here would need to change if larger RAM sizes were supported.

Not sure how RAM size is detected with DIMM modules. IIRC, with PCs, it 
was more probe along linearly until one finds an address that no longer 
returns valid data (say, if one hits the 1GB mark, and gets back 000000 
or FFFFFFF or similar, assume end of RAM at 1GB).


One does need to make sure caches (including L2 cache) are flushed 
during all this, as the caches doing their usual cache thing, may 
incorrectly detect larger RAM than actually exists.


....


> - anton