Deutsch English Français Italiano |
<v038qo$bmtm$3@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Paul A. Clayton" <paaronclayton@gmail.com> Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Sat, 20 Apr 2024 19:19:53 -0400 Organization: A noiseless patient Spider Lines: 61 Message-ID: <v038qo$bmtm$3@dont-email.me> References: <uuk100$inj$1@dont-email.me> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me> <7uSRN.161295$m4d.65414@fx43.iad> <e4443c417f7145d65b04bec48160c629@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 21 Apr 2024 16:45:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5d52f8e0f0694c11b30894cb014da68f"; logging-data="383926"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19XRrEqjjwrZ/AdcLMHGust7qII3iW5Fbs=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0 Cancel-Lock: sha1:Q9F+O7czcn0qfW1fzQfFRWDqds4= In-Reply-To: <e4443c417f7145d65b04bec48160c629@www.novabbs.org> Bytes: 4641 On 4/11/24 7:12 PM, MitchAlsup1 wrote: > Scott Lurndal wrote: [snip] >> It seems to me that an offloaded DMA engine would be a far >> better way to do memmove (over some threshhold, perhaps a >> cache line) without trashing the caches. Likewise memset. > > Effectively, that is what HW does, even on the lower end machines, > the AGEN unit of the Cache access pipeline is repeatedly cycled, > and data is read and/or written. One can execute instructions not > needing memory references while LDM, STM, ENTER, EXIT, MM, and MS > are in progress. > > Moving this sequencer farther out would still require it to consume > all L1 BW in any event (snooping) for memory consistency reasons. > {Note: cache accesses are performed line-wide not register width > wide} If the data was not in L1 cache, only its absence would need to be determined by the DMA engine. A snoop filter, tag-inclusive L2/L3 probing, or similar mechanism could avoid L1 accesses. Even if the source or destination for a memory copy was in L1, only one L1 access per cache line might be needed. I also wonder if the cache fill and/or spill mechanism might be decoupled from the load/store such that if the cache had enough banks/subarrays some loads and stores might be done in parallel with a cache fill or spill/external-read-without-eviction. Tag checking would limit the utility of such, though tags might also be banked or access flexibly scheduled (at the cost of choosing a victim early for fills). Of course, if the cache has such available bandwidth, why not make it available to the core as well even if it was rarely useful? (Perhaps higher register bandwidth might be more difficult than higher cache bandwidth for banking- friendly patterns?) Deciding when to bypass cache seems difficult (for both software developers and hardware). Overwriting cache lines within the same memory copy is obviously silly. Filling a cache with a memory copy is also suboptimal, but L1 hardware copy-on-write would probably be too complicated even with page aligned copies. A copy from cacheable memory to uncacheable memory (I/O) might be a strong hint that the source should not be installed into L1 or L2 cache, but I would guess that not installing the source would often be the right choice. I could also imagine a programmer wanting to use memory copy as a prefetch *directive* for a large chunk of memory (by having source and destination be the same). This idiom would be easy to detect (from and to base registers being the same), but may be too niche to be worth detecting (for most implementations). (My 66000 might use an idiom with a prefetch instruction preceding a memory move to indicate the cache level of the destination but that only manages [some of] the difficulty of the hardware choice.) For memset, compression is also an obvious possibility. A memset might not write any cache lines but rather cache the address range and the set value and perform hardware copy on access into cache lines.