| Deutsch English Français Italiano |
|
<vh0mdo$1qn42$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arm ldaxr / stxr loop question
Date: Tue, 12 Nov 2024 16:55:40 -0600
Organization: A noiseless patient Spider
Lines: 272
Message-ID: <vh0mdo$1qn42$1@dont-email.me>
References: <vfono1$14l9r$1@dont-email.me>
<YROdnVIXfKmwYrn6nZ2dnZfqn_GdnZ2d@supernews.com>
<vg5tf7$3tqmi$2@dont-email.me> <vgm0g1$3c2t2$3@dont-email.me>
<zwwXO.842112$_o_3.379966@fx17.iad> <vgm4vj$3d2as$1@dont-email.me>
<vgm5cb$3d2as$3@dont-email.me> <OnzXO.657386$1m96.281665@fx15.iad>
<TfKXO.658488$1m96.146506@fx15.iad> <T99YO.79275$MoU3.7336@fx36.iad>
<3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Nov 2024 23:55:53 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c8eefc65de78a023932ee25bfee8de3b";
logging-data="1924226"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19KdeRLU/lYsIVyHU4H92NDGP8CC/jCCTA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:XNhqwdkbrr/xU2nKYcxuHhuWzko=
Content-Language: en-US
In-Reply-To: <3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com>
Bytes: 13827
On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> Any idea what is the advantage for them having all these various
>> LDxxx and STxxx instructions that only seem to combine a LD or ST
>> with a fence instruction? Why have
>> LDAPR Load-Acquire RCpc Register
>> LDAR Load-Acquire Register
>> LDLAR LoadLOAcquire Register
>>
>> plus all the variations for byte, half, word, and pair,
>> instead of just the standard LDx and a general data fence instruction?
>
> All this, and much more can be discovered by reading the AMBA
> specifications. However, the main point is that the content of the
> target address does not have to be transferred to the local cache:
> these are remote atomic operations. Quite nice for things like
> fire-and-forget counters, for example.
>
I ended up mostly with a simpler model, IMO:
Normal / RAM-like: Fetch cache line, write back when evicting;
Operations: LoadTile, StoreTile, SwapTile,
LoadPrefetch, StorePrefetch
Volatile (RAM like): Fetch, operate, write-back;
MMIO: Remote Load/Store/Swap request;
Operation is performed on target;
Currently only supports DWORD and QWORD access;
Operations are strictly sequential.
In theory, MMIO access could be added to RAM, but unclear if worth the
added cost and complexity of doing so. Could more easily enforce strict
consistency.
The LoadPrefetch and StorePrefetch operations:
LoadPrefetch, try to perform a load from RAM
Always responds immediately
Signals whether it was an L2 hit or L2 Miss.
StorePrefetch
Basically like LoadPrefetch
Signals that the intention is to write to memory.
In my cache and bus design, I sometimes refer to cache lines as "tiles"
partly because of how I viewed them as operating, which didn't exactly
match the online descriptions of cache lines.
Say:
Tile:
16 bytes in the current implementation.
Accessed in even and odd rows
A memory access may span an even tile and an odd tile;
The L1 caches need to have a matched pair of tiles for an access.
Cache Line:
Usually described as always 32 bytes;
Descriptions seemed to assume only a single row of lines in caches.
Generally no mention of allowing for an even/odd scheme.
Seemingly, a cache that operated with cache lines would use a single row
of 32-bit cache lines, with misaligned accesses presumably spanning a
pair of adjacent cache lines. To fit with BRAM access patterns, would
likely need to split lines in half, and then mirror the relevant tag
bits (to allow detecting hit/miss).
However, online descriptions generally made no mention of how misaligned
accesses were intended to be handled within the limits of a dual-ported
RAM (1R1W).
My L2 cache operates in a way more like that of traditional descriptions
of cache lines, except that they are currently 64 bytes in my L2 cache
(and internally subdivided into four 16-byte parts).
The use of 64 bytes was mostly because this size got the most bandwidth
with my DDR interface (with 16 or 32 byte transfers, more cycles are
spent overhead; however latency was lower).
In this case, the L2<->RAM interface:
512 bit Load Data
512 bit Store Data
Load Address
Store Address
Request Code (IDLE/LOAD/STORE/SWAP)
Request Sequence Number
Response Code (READY/OK/HOLD/FAIL)
Response Sequence Number
Originally, there were no sequence numbers, and IDLE/READY signaling was
used between each request (needed to return to this state before
starting a new request). The sequence numbers avoided needing to return
to an IDLE/READY state, allowing the bandwidth over this interface to be
nearly doubled.
In a SWAP request, the Load and Store are performed end to end.
General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled,
effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each
direction for SWAP), which is fairly close to the theoretical limit
(internally, the logic for the DDR controller runs at 100MHz, driving IO
as 100MHz SDR, albeit using both posedge and negedge for sampling
responses from the DDR chip, so ~ 200 MHz if seen as SDR).
Theoretically, would be faster to access the chip using the SERDES
interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz
CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency
would dominate.
Could in theory have used Vivado MIG, but then I would have needed to
deal with AXI, and never crossed the threshold of wanting to deal with AXI.
Between CPU, L2, and various other devices, I am using a ringbus:
Connections:
128 bits data;
48 bits address (96 bits between L1 caches and TLB);
16 bits: request/response code and flags;
16 bits: source/dest node and request sequence number;
Each node has a set of input and output connections;
Each node may modify a request/response,
or simply forward from input to output.
Messages move along at one position per clock cycle.
Generally also 50 MHz at present (*1).
*1: Pretty much everything (apart from some hardware interfaces) runs on
the same clock. Some devices needed faster clocks. Any slower clocks
were generally faked using accumulator dividers (add a fraction every
clock-cycle and use the MSB of the accumulator as the virtual clock).
Comparably, the per-node logic cost isn't too high, nor is the logic
complexity. However, performance of the ring is very sensitive to ring
latency (and there are some amount of hacks to try to reduce the overall
latency of the ring in common paths).
At present, the highest resolution video modes that can be managed
semi-effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.
Can do 800x600 or similar in RGBI or color-cell modes (640x400 or
640x480 CC also being an option). Theoretically, there is a 1024x768
monochrome mode, but this is mostly untested. The 4-color and monochrome
modes had optional Bayer-pattern sub-modes to mimic full color.
Main modes I have ended up using:
80x25 and 80x50 text/color-cell modes;
Text and color cell graphics exist in the same mode.
320x200 hi-color (RGB555);
640x400 indexed 256 color.
Trying to go much higher than this, and the combination of ringbus
latency and L2 misses turns the display into a broken mess (with a DRAM
backed framebuffer). Originally, I had the framebuffer in Block-RAM, but
this in turn set the hard-limit based on framebuffer size (and putting
framebuffer in DRAM allowing for a bigger L2 cache).
Theoretically, could allow higher resolution modes by adding a fast path
between the display output and DDR RAM interface (with access then being
multiplexed with the L2 cache). Have not done so.
Or, possible but more radical:
Bolt the VGA output module directly to the L2 cache;
Could theoretically do 800x600 high-color
Would eat around 2/3 of total RAM bandwidth.
Major concern here is that setting resolutions too high would starve the
CPU of the ability to access memory (vs the current situation where
trying to set higher resolutions mostly results in progressively worse
display glitches).
Logic would need to be in place so that display can't totally hog the
========== REMAINDER OF ARTICLE TRUNCATED ==========