Deutsch English Français Italiano |
<vh0mdo$1qn42$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Arm ldaxr / stxr loop question Date: Tue, 12 Nov 2024 16:55:40 -0600 Organization: A noiseless patient Spider Lines: 272 Message-ID: <vh0mdo$1qn42$1@dont-email.me> References: <vfono1$14l9r$1@dont-email.me> <YROdnVIXfKmwYrn6nZ2dnZfqn_GdnZ2d@supernews.com> <vg5tf7$3tqmi$2@dont-email.me> <vgm0g1$3c2t2$3@dont-email.me> <zwwXO.842112$_o_3.379966@fx17.iad> <vgm4vj$3d2as$1@dont-email.me> <vgm5cb$3d2as$3@dont-email.me> <OnzXO.657386$1m96.281665@fx15.iad> <TfKXO.658488$1m96.146506@fx15.iad> <T99YO.79275$MoU3.7336@fx36.iad> <3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 12 Nov 2024 23:55:53 +0100 (CET) Injection-Info: dont-email.me; posting-host="c8eefc65de78a023932ee25bfee8de3b"; logging-data="1924226"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19KdeRLU/lYsIVyHU4H92NDGP8CC/jCCTA=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:XNhqwdkbrr/xU2nKYcxuHhuWzko= Content-Language: en-US In-Reply-To: <3lGdnVvGQIAq2676nZ2dnZfqnPGdnZ2d@supernews.com> Bytes: 13827 On 11/12/2024 6:14 AM, aph@littlepinkcloud.invalid wrote: > EricP <ThatWouldBeTelling@thevillage.com> wrote: >> Any idea what is the advantage for them having all these various >> LDxxx and STxxx instructions that only seem to combine a LD or ST >> with a fence instruction? Why have >> LDAPR Load-Acquire RCpc Register >> LDAR Load-Acquire Register >> LDLAR LoadLOAcquire Register >> >> plus all the variations for byte, half, word, and pair, >> instead of just the standard LDx and a general data fence instruction? > > All this, and much more can be discovered by reading the AMBA > specifications. However, the main point is that the content of the > target address does not have to be transferred to the local cache: > these are remote atomic operations. Quite nice for things like > fire-and-forget counters, for example. > I ended up mostly with a simpler model, IMO: Normal / RAM-like: Fetch cache line, write back when evicting; Operations: LoadTile, StoreTile, SwapTile, LoadPrefetch, StorePrefetch Volatile (RAM like): Fetch, operate, write-back; MMIO: Remote Load/Store/Swap request; Operation is performed on target; Currently only supports DWORD and QWORD access; Operations are strictly sequential. In theory, MMIO access could be added to RAM, but unclear if worth the added cost and complexity of doing so. Could more easily enforce strict consistency. The LoadPrefetch and StorePrefetch operations: LoadPrefetch, try to perform a load from RAM Always responds immediately Signals whether it was an L2 hit or L2 Miss. StorePrefetch Basically like LoadPrefetch Signals that the intention is to write to memory. In my cache and bus design, I sometimes refer to cache lines as "tiles" partly because of how I viewed them as operating, which didn't exactly match the online descriptions of cache lines. Say: Tile: 16 bytes in the current implementation. Accessed in even and odd rows A memory access may span an even tile and an odd tile; The L1 caches need to have a matched pair of tiles for an access. Cache Line: Usually described as always 32 bytes; Descriptions seemed to assume only a single row of lines in caches. Generally no mention of allowing for an even/odd scheme. Seemingly, a cache that operated with cache lines would use a single row of 32-bit cache lines, with misaligned accesses presumably spanning a pair of adjacent cache lines. To fit with BRAM access patterns, would likely need to split lines in half, and then mirror the relevant tag bits (to allow detecting hit/miss). However, online descriptions generally made no mention of how misaligned accesses were intended to be handled within the limits of a dual-ported RAM (1R1W). My L2 cache operates in a way more like that of traditional descriptions of cache lines, except that they are currently 64 bytes in my L2 cache (and internally subdivided into four 16-byte parts). The use of 64 bytes was mostly because this size got the most bandwidth with my DDR interface (with 16 or 32 byte transfers, more cycles are spent overhead; however latency was lower). In this case, the L2<->RAM interface: 512 bit Load Data 512 bit Store Data Load Address Store Address Request Code (IDLE/LOAD/STORE/SWAP) Request Sequence Number Response Code (READY/OK/HOLD/FAIL) Response Sequence Number Originally, there were no sequence numbers, and IDLE/READY signaling was used between each request (needed to return to this state before starting a new request). The sequence numbers avoided needing to return to an IDLE/READY state, allowing the bandwidth over this interface to be nearly doubled. In a SWAP request, the Load and Store are performed end to end. General bandwidth for a 16-bit DDR2 chip running at 50MHz (DLL disabled, effectively a low-power / standby mode) is ~ 90 MB/sec (or 47 MB/s each direction for SWAP), which is fairly close to the theoretical limit (internally, the logic for the DDR controller runs at 100MHz, driving IO as 100MHz SDR, albeit using both posedge and negedge for sampling responses from the DDR chip, so ~ 200 MHz if seen as SDR). Theoretically, would be faster to access the chip using the SERDES interface, but: Hadn't gone up the learning curve for this; Unclear if I could really effectively utilize the bandwidth with a 50MHz CPU and my current bus; Actual bandwidth gains would be smaller, as then CAS and RAS latency would dominate. Could in theory have used Vivado MIG, but then I would have needed to deal with AXI, and never crossed the threshold of wanting to deal with AXI. Between CPU, L2, and various other devices, I am using a ringbus: Connections: 128 bits data; 48 bits address (96 bits between L1 caches and TLB); 16 bits: request/response code and flags; 16 bits: source/dest node and request sequence number; Each node has a set of input and output connections; Each node may modify a request/response, or simply forward from input to output. Messages move along at one position per clock cycle. Generally also 50 MHz at present (*1). *1: Pretty much everything (apart from some hardware interfaces) runs on the same clock. Some devices needed faster clocks. Any slower clocks were generally faked using accumulator dividers (add a fraction every clock-cycle and use the MSB of the accumulator as the virtual clock). Comparably, the per-node logic cost isn't too high, nor is the logic complexity. However, performance of the ring is very sensitive to ring latency (and there are some amount of hacks to try to reduce the overall latency of the ring in common paths). At present, the highest resolution video modes that can be managed semi-effectively are 640x400 and 640x480 256-color (60Hz), or ~ 20 MB/sec. Can do 800x600 or similar in RGBI or color-cell modes (640x400 or 640x480 CC also being an option). Theoretically, there is a 1024x768 monochrome mode, but this is mostly untested. The 4-color and monochrome modes had optional Bayer-pattern sub-modes to mimic full color. Main modes I have ended up using: 80x25 and 80x50 text/color-cell modes; Text and color cell graphics exist in the same mode. 320x200 hi-color (RGB555); 640x400 indexed 256 color. Trying to go much higher than this, and the combination of ringbus latency and L2 misses turns the display into a broken mess (with a DRAM backed framebuffer). Originally, I had the framebuffer in Block-RAM, but this in turn set the hard-limit based on framebuffer size (and putting framebuffer in DRAM allowing for a bigger L2 cache). Theoretically, could allow higher resolution modes by adding a fast path between the display output and DDR RAM interface (with access then being multiplexed with the L2 cache). Have not done so. Or, possible but more radical: Bolt the VGA output module directly to the L2 cache; Could theoretically do 800x600 high-color Would eat around 2/3 of total RAM bandwidth. Major concern here is that setting resolutions too high would starve the CPU of the ability to access memory (vs the current situation where trying to set higher resolutions mostly results in progressively worse display glitches). Logic would need to be in place so that display can't totally hog the ========== REMAINDER OF ARTICLE TRUNCATED ==========