Article <vh0mdo$1qn42$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <vh0mdo$1qn42$1@dont-email.me>
Deutsch English Français Italiano
<vh0mdo$1qn42$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arm ldaxr / stxr loop question
Date: Tue, 12 Nov 2024 16:55:40 -0600
Organization: A noiseless patient Spider
Lines: 272
Message-ID: <vh0mdo$1qn42$1@dont-email.me>
References: <vfono1$14l9r$1@dont-email.me>
 <YROdnVIXfKmwYrn6nZ2dnZfqn_GdnZ2d@supernews.com>
 <vg5tf7$3tqmi$2@dont-email.me> <vgm0g1$3c2t2$3@dont-email.me>
 <zwwXO.842112$_o_3.379966@fx17.iad> <vgm4vj$3d2as$1@dont-email.me>
 <vgm5cb$3d2as$3@dont-email.me> <OnzXO.657386$1m96.281665@fx15.iad>
 <TfKXO.658488$1m96.146506@fx15.iad> <T99YO.79275$MoU3.7336@fx36.iad>
 <3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 12 Nov 2024 23:55:53 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c8eefc65de78a023932ee25bfee8de3b";
	logging-data="1924226"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19KdeRLU/lYsIVyHU4H92NDGP8CC/jCCTA="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:XNhqwdkbrr/xU2nKYcxuHhuWzko=
Content-Language: en-US
In-Reply-To: <3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com>
Bytes: 13827

On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote:
> EricP <ThatWouldBeTelling@thevillage.com> wrote:
>> Any idea what is the advantage for them having all these various
>> LDxxx and STxxx instructions that only seem to combine a LD or ST
>> with a fence instruction? Why have
>> LDAPR Load-Acquire RCpc Register
>> LDAR Load-Acquire Register
>> LDLAR LoadLOAcquire Register
>>
>> plus all the variations for byte, half, word, and pair,
>> instead of just the standard LDx and a general data fence instruction?
> 
> All this, and much more can be discovered by reading the AMBA
> specifications. However, the main point is that the content of the
> target address does not have to be transferred to the local cache:
> these are remote atomic operations. Quite nice for things like
> fire-and-forget counters, for example.
> 

I ended up mostly with a simpler model, IMO:
   Normal / RAM-like: Fetch cache line, write back when evicting;
     Operations: LoadTile, StoreTile, SwapTile,
       LoadPrefetch, StorePrefetch
   Volatile (RAM like): Fetch, operate, write-back;
   MMIO: Remote Load/Store/Swap request;
     Operation is performed on target;
     Currently only supports DWORD and QWORD access;
     Operations are strictly sequential.

In theory, MMIO access could be added to RAM, but unclear if worth the 
added cost and complexity of doing so. Could more easily enforce strict 
consistency.

The LoadPrefetch and StorePrefetch operations:
   LoadPrefetch, try to perform a load from RAM
     Always responds immediately
     Signals whether it was an L2 hit or L2 Miss.
   StorePrefetch
     Basically like LoadPrefetch
     Signals that the intention is to write to memory.


In my cache and bus design, I sometimes refer to cache lines as "tiles" 
partly because of how I viewed them as operating, which didn't exactly 
match the online descriptions of cache lines.

Say:
   Tile:
     16 bytes in the current implementation.
     Accessed in even and odd rows
       A memory access may span an even tile and an odd tile;
       The L1 caches need to have a matched pair of tiles for an access.
   Cache Line:
     Usually described as always 32 bytes;
     Descriptions seemed to assume only a single row of lines in caches.
       Generally no mention of allowing for an even/odd scheme.

Seemingly, a cache that operated with cache lines would use a single row 
of 32-bit cache lines, with misaligned accesses presumably spanning a 
pair of adjacent cache lines. To fit with BRAM access patterns, would 
likely need to split lines in half, and then mirror the relevant tag 
bits (to allow detecting hit/miss).

However, online descriptions generally made no mention of how misaligned 
accesses were intended to be handled within the limits of a dual-ported 
RAM (1R1W).


My L2 cache operates in a way more like that of traditional descriptions 
of cache lines, except that they are currently 64 bytes in my L2 cache 
(and internally subdivided into four 16-byte parts).

The use of 64 bytes was mostly because this size got the most bandwidth 
with my DDR interface (with 16 or 32 byte transfers, more cycles are 
spent overhead; however latency was lower).

In this case, the L2<->RAM interface:
   512 bit Load Data
   512 bit Store Data
   Load Address
   Store Address
   Request Code (IDLE/LOAD/STORE/SWAP)
   Request Sequence Number
   Response Code (READY/OK/HOLD/FAIL)
   Response Sequence Number

Originally, there were no sequence numbers, and IDLE/READY signaling was 
used between each request (needed to return to this state before 
starting a new request). The sequence numbers avoided needing to return 
to an IDLE/READY state, allowing the bandwidth over this interface to be 
nearly doubled.

In a SWAP request, the Load and Store are performed end to end.

General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, 
effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each 
direction for SWAP), which is fairly close to the theoretical limit 
(internally, the logic for the DDR controller runs at 100MHz, driving IO 
as 100MHz SDR, albeit using both posedge and negedge for sampling 
responses from the DDR chip, so ~ 200 MHz if seen as SDR).

Theoretically, would be faster to access the chip using the SERDES 
interface, but:
Hadn't gone up the learning curve for this;
Unclear if I could really effectively utilize the bandwidth with a 50MHz 
CPU and my current bus;
Actual bandwidth gains would be smaller, as then CAS and RAS latency 
would dominate.

Could in theory have used Vivado MIG, but then I would have needed to 
deal with AXI, and never crossed the threshold of wanting to deal with AXI.


Between CPU, L2, and various other devices, I am using a ringbus:
   Connections:
     128 bits data;
     48 bits address (96 bits between L1 caches and TLB);
     16 bits: request/response code and flags;
     16 bits: source/dest node and request sequence number;
   Each node has a set of input and output connections;
     Each node may modify a request/response,
       or simply forward from input to output.
     Messages move along at one position per clock cycle.
       Generally also 50 MHz at present (*1).

*1: Pretty much everything (apart from some hardware interfaces) runs on 
the same clock. Some devices needed faster clocks. Any slower clocks 
were generally faked using accumulator dividers (add a fraction every 
clock-cycle and use the MSB of the accumulator as the virtual clock).


Comparably, the per-node logic cost isn't too high, nor is the logic 
complexity. However, performance of the ring is very sensitive to ring 
latency (and there are some amount of hacks to try to reduce the overall 
latency of the ring in common paths).


At present, the highest resolution video modes that can be managed 
semi-effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec.

Can do 800x600 or similar in RGBI or color-cell modes (640x400 or 
640x480 CC also being an option). Theoretically, there is a 1024x768 
monochrome mode, but this is mostly untested. The 4-color and monochrome 
modes had optional Bayer-pattern sub-modes to mimic full color.

Main modes I have ended up using:
   80x25 and 80x50 text/color-cell modes;
     Text and color cell graphics exist in the same mode.
   320x200 hi-color (RGB555);
   640x400 indexed 256 color.


Trying to go much higher than this, and the combination of ringbus 
latency and L2 misses turns the display into a broken mess (with a DRAM 
backed framebuffer). Originally, I had the framebuffer in Block-RAM, but 
this in turn set the hard-limit based on framebuffer size (and putting 
framebuffer in DRAM allowing for a bigger L2 cache).

Theoretically, could allow higher resolution modes by adding a fast path 
between the display output and DDR RAM interface (with access then being 
multiplexed with the L2 cache). Have not done so.

Or, possible but more radical:
   Bolt the VGA output module directly to the L2 cache;
   Could theoretically do 800x600 high-color
     Would eat around 2/3 of total RAM bandwidth.

Major concern here is that setting resolutions too high would starve the 
CPU of the ability to access memory (vs the current situation where 
trying to set higher resolutions mostly results in progressively worse 
display glitches).

Logic would need to be in place so that display can't totally hog the 
========== REMAINDER OF ARTICLE TRUNCATED ==========