| Deutsch English Français Italiano |
|
<vh8cbo$3j8c5$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Memory ordering
Date: Fri, 15 Nov 2024 14:53:00 -0600
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <vh8cbo$3j8c5$1@dont-email.me>
References: <vfono1$14l9r$1@dont-email.me> <vgm4vj$3d2as$1@dont-email.me>
<vgm5cb$3d2as$3@dont-email.me> <YfxXO.384093$EEm7.56154@fx16.iad>
<vh4530$2mar5$1@dont-email.me>
<-rKdnTO4LdoWXKj6nZ2dnZfqnPWdnZ2d@supernews.com>
<vh5t5b$312cl$2@dont-email.me>
<5yqdnU9eL_Y_GKv6nZ2dnZfqn_GdnZ2d@supernews.com>
<2024Nov15.082512@mips.complang.tuwien.ac.at> <vh7rlr$3fu9i$1@dont-email.me>
<2024Nov15.182737@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 15 Nov 2024 21:53:13 +0100 (CET)
Injection-Info: dont-email.me; posting-host="00d436b94db1d6c2525abde220a5befd";
logging-data="3776901"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX180PQxyVNjLVIvHabh50NGy4z0VXzclQxs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Yn9yrdAFB8fkmY0ZoI61jOgpQ4k=
Content-Language: en-US
In-Reply-To: <2024Nov15.182737@mips.complang.tuwien.ac.at>
Bytes: 7056
On 11/15/2024 11:27 AM, Anton Ertl wrote:
> jseigh <jseigh_es00@xemaps.com> writes:
>> Anybody doing that sort of programming, i.e. lock-free or distributed
>> algorithms, who can't handle weakly consistent memory models, shouldn't
>> be doing that sort of programming in the first place.
>
> Do you have any argument that supports this claim.
>
>> Strongly consistent memory won't help incompetence.
>
> Strong words to hide lack of arguments?
>
In my case, as I see it:
The tradeoff is more about implementation cost, performance, etc.
Weak model:
Cheaper (and simpler) to implement;
Performs better when there is no need to synchronize memory;
Performs worse when there is need to synchronize memory;
...
However, local to the CPU core:
Not respecting things like RAW hazards does not seem well advised.
Like, if we store to a location, and then immediately read back from it,
one can expect to see the most recently written value, not the previous
value. Or, if one stores to two adjacent memory locations, one expects
that both stores write the data correctly.
Granted, it is a tradeoff:
Not bothering: Fast, Cheap, but may break expected behavior;
Could naively use NOPs if aliasing is possible, but this is bad.
Add an interlock check, stall the pipeline if it happens:
Works, but can add a noticeable performance penalty;
My attempts at 75 and 100 MHz cores had often done this;
Sadly, memory RAW and WAW hazards are not exactly rare.
Use internal forwarding, so written data is used directly next cycle.
Better performance;
But, has a fairly high cost for the FPGA (*1).
*1: This factor (along with L1 cache sizes) weighs in heavily to why I
continue to use 50MHz. Otherwise, I could use 75 MHz, but this internal
forwarding logic, and L1 caches with 32K of BRAM (excluding metadata)
and 1-cycle access, are not really viable at 75 MHz.
For the L2 cache, which is much bigger, one can use a few extra
pad-cycles to access the Block-RAM array. Though, 5 cycle latency for
Load/Store operations would be, not good.
Can note that with Block-RAM, usual behavior seems to be that if one
tries to read from one port while writing to another port on the same
clock edge, if both are at the same location, the prior contents will be
returned. This may be a general behavior in Verilog though, rather than
a Block-RAM thing (also seems to apply to LUTRAM if accessed in the same
pattern; though LUTRAM allows also reading the value via combinatorial
logic rather than a clock-edge, which seems to always return the value
from the most recent clock-edge).
As I can note, a 4K or 8K L1 cache with stall on RAW or WAW, at 75 MHz,
tends to perform worse IME, than a 32K cache running at 50 MHz with no
RAW/WAW stall.
Also, trying to increase MHz by increasing instruction latency in many
cases was also not ideal for performance.
Granted, if I were to do things the "DEC Alpha" way, I probably could
run stuff at 75MHz, but then would likely need the compiler to insert a
bunch of strategic NOPs so that the program doesn't break.
For memory ordering, possibly, in my case a case could be made for an
"order respecting DRAM cache" via the MMIO interface, say:
F000_01000000..F000_3FFFFFFF
Could be defined to alias with the main RAM map, but with strictly
sequential ordering for every memory access across all cores (at the
expense of performance).
Where:
0000_00000000..7FFF_FFFFFFFF: Virtual Address Space
8000_00000000..BFFF_FFFFFFFF: Supervisor-Only Virtual Address Space
C000_00000000..CFFF_FFFFFFFF: Physical Address Space, Default Caching
D000_00000000..DFFF_FFFFFFFF: Physical Address Space, Volatile/NoCache
E000_00000000..EFFF_FFFFFFFF: Reserved
F000_00000000..FFFF_FFFFFFFF: MMIO Space
MMIO space is currently fully independent of RAM space.
However, at present:
FFFF_F0000000..FFFF_FFFFFFFF: MMIO Space, as Used for MMIO devices.
So, in theory, remerging RAM IO space into MMIO Space would be possible
(well, except that trying to access HW MMIO address ranges via RAM-space
access would likely be disallowed).
Can note, MMU disabled:
0000_00000000..0FFF_FFFFFFFF: Same as C000..CFFF space.
1000_00000000..7FFF_FFFFFFFF: Invalid
....
Granted, current scheme does set a limit of 16TB of RAM.
But, biggest FPGA boards I have only have 256MB, so, ...
And, current VA map within TestKern (from memory):
0000_00000000..0000_00FFFFFF: NULL Space
0000_01000000..0000_3FFFFFFF: RAM Range (Identity Mapped)
0000_40000000..0000_BFFFFFFF: Direct Page Mapping (no swap)
0001_00000000..3FFF_FFFFFFFF: Mapped to swapfile, Global
4000_00000000..7FFF_FFFFFFFF: Process Local
Note that, within the RAM-range, the RAM will wrap around. The specifics
of the wraparound are used to detect RAM size (this would set an
effective limit at 512MB, after which no wraparound would be detected).
Specifics here would need to change if larger RAM sizes were supported.
Not sure how RAM size is detected with DIMM modules. IIRC, with PCs, it
was more probe along linearly until one finds an address that no longer
returns valid data (say, if one hits the 1GB mark, and gets back 000000
or FFFFFFF or similar, assume end of RAM at 1GB).
One does need to make sure caches (including L2 cache) are flushed
during all this, as the caches doing their usual cache thing, may
incorrectly detect larger RAM than actually exists.
....
> - anton