Article <velj60$1lhfe$2@dont-email.me>

Deutsch English Français Italiano
<velj60$1lhfe$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.arch
Subject: Re: memcpy and friend (was: 80286 protected mode)
Date: Tue, 15 Oct 2024 13:20:31 +0200
Organization: A noiseless patient Spider
Lines: 251
Message-ID: <velj60$1lhfe$2@dont-email.me>
References: <2024Oct6.150415@mips.complang.tuwien.ac.at>
 <memo.20241006163428.19028W@jgd.cix.co.uk>
 <2024Oct7.093314@mips.complang.tuwien.ac.at>
 <7c8e5c75ce0f1e7c95ec3ae4bdbc9249@www.novabbs.org>
 <2024Oct8.092821@mips.complang.tuwien.ac.at> <ve5ek3$2jamt$1@dont-email.me>
 <ve6gv4$2o2cj$1@dont-email.me> <ve6olo$2pag3$2@dont-email.me>
 <73e776d6becb377b484c5dcc72b526dc@www.novabbs.org>
 <ve7sco$31tgt$1@dont-email.me>
 <2b31e1343b1f3fadd55ad6b87d879b78@www.novabbs.org>
 <ve99fg$38kta$1@dont-email.me> <veh6j8$q71j$1@dont-email.me>
 <vej5p5$1772o$1@dont-email.me> <vejagr$181vo$1@dont-email.me>
 <vejcqc$1772o$3@dont-email.me> <20241014190856.00003a58@yahoo.com>
 <velaia$1kbdj$1@dont-email.me> <20241015131241.00006023@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 15 Oct 2024 13:20:32 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="47228e08c2736a5aef9f5441cfbf6fae";
	logging-data="1754606"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19qNuexlhoPQ6lN+bmgAeDB2kcDc5MkHzo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:UOgAcFoT3FMrGWCiGZazzvQP0Po=
Content-Language: en-GB
In-Reply-To: <20241015131241.00006023@yahoo.com>
Bytes: 12740

On 15/10/2024 12:12, Michael S wrote:
> On Tue, 15 Oct 2024 10:53:30 +0200
> David Brown <david.brown@hesbynett.no> wrote:
> 
>> On 14/10/2024 18:08, Michael S wrote:
>>> On Mon, 14 Oct 2024 17:19:40 +0200
>>> David Brown <david.brown@hesbynett.no> wrote:
>>>    
>>>> On 14/10/2024 16:40, Terje Mathisen wrote:

(I'm snipping for space - hopefully not too much.)

>>>>   
>>>>> REP MOVSB on x86 does the canonical memcpy() operation, originally
>>>>> by moving single bytes, and this was so slow that we also had REP
>>>>> MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and
>>>>> REP MOVSQ on 64-bit cpus.
>>>>>
>>>>> With a suitable chunk of logic, the basic MOVSB operation could in
>>>>> fact handle any kinds of alignments and sizes, while doing the
>>>>> actual transfer at maximum bus speeds, i.e. at least one cache
>>>>> line/cycle for things already in $L1.
>>>>>       
>>>>
>>>> I agree on all of that.
>>>>
>>>> I am quite happy with the argument that suitable hardware can do
>>>> these basic operations faster than a software loop or the x86 "rep"
>>>> instructions.
>>>
>>> No, that's not true. And according to my understanding, that's not
>>> what Terje wrote.
>>> REP MOVSB _is_ almost ideal instruction for memcpy (modulo minor
>>> details - fixed registers for src, dest, len and Direction flag in
>>> PSW instead of being part of the opcode).
>>
>> My understanding of what Terje wrote is that REP MOVSB /could/ be an
>> efficient solution if it were backed by a hardware block to run well
>> (i.e., transferring as many bytes per cycle as memory bus bandwidth
>> allows).  But REP MOVSB is /not/ efficient - and rather than making
>> it work faster, Intel introduced variants with wider fixed sizes.
>>
> 
> Above count of ~2000 byte REP MOVSB on few latest generations of Intel
> and AMD is very efficient.

OK.  That is news to me, and different from what I had thought.

> One can construct a case where software implementation is a little
> faster in one or another selected benchmark, but typically at cost
> of being slower in other situations.
> For smaller counts a story is different.
> 
>> Could REP MOVSB realistically be improved to be as efficient as the
>> instructions in ARMv9, RISC-V, and Mitch'es "MM" instruction?  I
>> don't know.  Intel and AMD have had many decades to do so, so I
>> assume it's not an easy improvement.
>>
> 
> You somehow assume that REP MOVSB would have to be improved. 

That is certainly what I have been assuming.  I haven't investigated it 
myself in any way, I've merely inferred it from other posts.  So unless 
someone else provides more information, I'll take your word for it that 
at least for modern x86 devices and large copies, it's already about as 
efficient as it could be.

> That
> remains to be seen.
> It's quite likely that when (or 'if', in case of My 66000) those
> alternatives you mention hit silicon we will find out that REP MOVSB is
> already better as it is, at least for memcpy(). For memmove(), esp.
> for short memmove(),  REP MOVSB is easier to beat, because it was not
> designed with memmove() in mind.
> 
>>> REP MOVSW/D/Q were introduced because back then processors were
>>> small and stupid. When your processor is big and smart you don't
>>> need them any longer. REP MOVSB is sufficient.
>>> New Arm64 instruction that are hopefully coming next year are akin
>>> to REP MOVSB rather than to MOVSW/D/Q.
>>> Instructions for memmove, also defined by Arm and by Mitch, is the
>>> next logical step. IMHO, the main gain here is not measurable
>>> improvement in performance, but saving of code size when inlined.
>>>
>>> Now, is all that a good idea?
>>
>> That's a very important question.
>>
>>> I am not 100% convinced.
>>> One can argue that streaming alignment hardware that is necessary
>>> for 1st-class implementation of these instructions is useful not
>>> only for memory copy.
>>> So, may be, it makes sense to expose this hardware in more generic
>>> ways.
>>
>> I believe that is the idea of "scalable vector" instructions as an
>> alternative philosophy to wide explicit SIMD registers.  My
>> expectation is that SVE implementations will be more effort in the
>> hardware than SIMD for any specific SIMD-friendly size point (i.e.,
>> power-of-two widths).  That usually corresponds to lower clock rates
>> and/or higher latency and more coordination from extra pipeline
>> stages.
>>
>> But once you have SVE support in place, then memcpy() and memset()
>> are just examples of vector operations that you get almost for free
>> when you have hardware for vector MACs and other operations.
>>
> 
> You don't seem to understand what is 'S' in SVE.
> Read more manuals. Read less marketing slides.
> Or try to write and profile code that utilizes SVE - that would improve
> your understanding more than anything else.
> 

It means "scalable".  The idea is that the same binary code will use 
different stride sizes on different hardware - a bigger implementation 
of the core might have vector units handling wider strides than a 
smaller one.  Am I missing something?

> Also, you don't seem to understand an issue at hand, which is exposing
> a hardware that aligns *stream* of N+1 aligned loads turning it into N
> unaligned loads.
> In absence of 'load multiple' instruction 128-bit SVE would help you
> here no more than 128-bit NEON. More so, 512-bit SVE wouldn't help
> enough, even ignoring absence of prospect of 512-bit SVE in mainstream
> ARM64 cores.
> May be, at ISA level, SME is a better base to what is wanted.
> But
>   - SME would be quite bad for copy of small segments.

I would expect a certain amount of overhead, which will be a cost for 
small copies.

>   - SME does not appear to get much love by Arm vendors others than Apple

If you say so.  My main interest is in microcontrollers, and I don't 
track all the details of larger devices.

>   - SME blocks are expected to be implemented not in close proximity to
>     the rest of the CPU core, which would make them problematic not just
>     for copying small segment, but for medium-length segments (few KB)
>     as well.
> 

That sounds like a poor design choice to me, but again I don't know the 
details.

>>> May be, via Load Multiple Register? It was present in Arm's A32/T32,
>>> but didn't make it into ARM64. Or, may be, there are even better
>>> ways that I was not thinking about.
>>>    
>>>> And I fully agree that these would be useful features
>>>> in general-purpose processors.
>>>>
>>>> My only point of contention is that the existence or lack of such
>>>> instructions does not make any difference to whether or not you can
>>>> write a good implementation of memcpy() or memmove() in portable
>>>> standard C.
>>>
>>> You are moving a goalpost.
>>
>> No, my goalposts have been in the same place all the time.  Some
>> others have been kicking the ball at a completely different set of
>> goalposts, but I have kept the same point all along.
>>
>>> One does not need "good implementation" in a sense you have in
========== REMAINDER OF ARTICLE TRUNCATED ==========