Path: ...!feeds.phibee-telecom.net!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Thu, 11 Apr 2024 23:12:25 +0000
Organization: Rocksolid Light
Message-ID: <e4443c417f7145d65b04bec48160c629@www.novabbs.org>
References: <uuk100$inj$1@dont-email.me> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me> <7uSRN.161295$m4d.65414@fx43.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="872654"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$UjW8C0nybkeBVTsf53OHber1QE.2/Zs.TQp7ADoQd1iamNK2NySCi
Bytes: 3668
Lines: 52

Scott Lurndal wrote:

> "Paul A. Clayton" <paaronclayton@gmail.com> writes:
>>On 4/9/24 8:28 PM, MitchAlsup1 wrote:
>>> BGB-Alt wrote:
>>[snip]
>>>> Things like memcpy/memmove/memset/etc, are function calls in 
>>>> cases when not directly transformed into register load/store 
>>>> sequences.
>>> 
>>> My 66000 does not convert them into LD-ST sequences, MM is a 
>>> single instruction.
>>
>>I wonder if it would be useful to have an immediate count form of
>>memory move. Copying fixed-size structures would be able to use an
>>immediate. Aside from not having to load an immediate for such
>>cases, there might be microarchitectural benefits to using a
>>constant. Since fixed-sized copies would likely be limited to
>>smaller regions (with the possible exception of 8 MiB page copies)
>>and the overhead of loading a constant for large sizes would be
>>tiny, only providing a 16-bit immediate form might be reasonable.

> It seems to me that an offloaded DMA engine would be a far
> better way to do memmove (over some threshhold, perhaps a
> cache line) without trashing the caches.   Likewise memset.

Effectively, that is what HW does, even on the lower end machines,
the AGEN unit of the Cache access pipeline is repeatedly cycled,
and data is read and/or written. One can execute instructions not
needing memory references while LDM, STM, ENTER, EXIT, MM, and MS
are in progress.

Moving this sequencer farther out would still require it to consume
all L1 BW in any event (snooping) for memory consistency reasons.
{Note: cache accesses are performed line-wide not register width wide}

>>
>>>> Did end up with an intermediate "memcpy slide", which can handle 
>>>> medium size memcpy and memset style operations by branching into 
>>>> a slide.
>>> 
>>> MMs and MSs that do not cross page boundaries are ATOMIC. The 
>>> entire system
>>> sees only the before or only the after state and nothing in 
>>> between. 

> One might wonder how that atomicity is guaranteed in a
> SMP processor...

The entire chunk of data traverses the interconnect as a single
transaction. All interested 3rd parties (not originator nor
recipient) see either the memory state before the transfer or
after the transfer.