Article <20240909160847.000062a2@yahoo.com>

Deutsch English Français Italiano
<20240909160847.000062a2@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Mon, 9 Sep 2024 16:08:47 +0300
Organization: A noiseless patient Spider
Lines: 116
Message-ID: <20240909160847.000062a2@yahoo.com>
References: <2024Aug30.161204@mips.complang.tuwien.ac.at>
	<86v7zep35n.fsf@linuxsc.com>
	<20240902180903.000035ee@yahoo.com>
	<vb7ank$3d0c5$1@dont-email.me>
	<20240903190928.00002f92@yahoo.com>
	<vb7idh$3e2af$1@dont-email.me>
	<86seufo11j.fsf@linuxsc.com>
	<vba6qa$3u4jc$1@dont-email.me>
	<1246395e530759ac79805e45b3830d8f@www.novabbs.org>
	<8634m9lga1.fsf@linuxsc.com>
	<vbmb3h$2bfqh$1@dont-email.me>
	<20240909122219.00007f81@yahoo.com>
	<2024Sep9.123034@mips.complang.tuwien.ac.at>
	<20240909145854.00001e4e@yahoo.com>
	<2024Sep9.142813@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 09 Sep 2024 15:08:25 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="45fff2496b15112b5e4e03cadfa28742";
	logging-data="2041887"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/fS8LQ6jm+GADwVcF81w50WlFQMzTZunc="
Cancel-Lock: sha1:3UOnROMuBLgZyFOLpDzFRSUkX34=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
Bytes: 5611

On Mon, 09 Sep 2024 12:28:13 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >On Mon, 09 Sep 2024 10:30:34 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:  
> >> One would hope so, but here's what happens with gcc-12:
> >> 
> >> #include <string.h>
> >> 
> >> void foo1(char *p, char* q)
> >> {
> >>   memcpy(p,q,32);
> >> }
> >> 
> >> void foo2(char *p, char* q)
> >> {
> >>   memmove(p,q,32);
> >> }
> >> 
> >> gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o:
> >> 
> >> 0000000000000000 <foo1>:
> >>    0:   c5 fa 6f 06             vmovdqu (%rsi),%xmm0
> >>    4:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)
> >>    8:   c5 fa 6f 4e 10          vmovdqu 0x10(%rsi),%xmm1
> >>    d:   c5 fa 7f 4f 10          vmovdqu %xmm1,0x10(%rdi)
> >>   12:   c3                      ret
> >>   13:   66 66 2e 0f 1f 84 00    data16 cs nopw 0x0(%rax,%rax,1)
> >>   1a:   00 00 00 00 
> >>   1e:   66 90                   xchg   %ax,%ax
> >> 
> >> 0000000000000020 <foo2>:
> >>   20:   ba 20 00 00 00          mov    $0x20,%edx
> >>   25:   e9 00 00 00 00          jmp    2a <foo2+0xa>
> >> 
> >> The jmp in line 25 is probably a tail-call to memmove().
> >> 
> >> My guess is that xmm registers and unrolling are used here rather
> >> than ymm registers because waking up the second 128 bits takes
> >> time.  But even with that, the code uses two different registers,
> >> and if scheduled differently, could be used for implementing
> >> foo2():
> >> 
> >>    0:   c5 fa 6f 06             vmovdqu (%rsi),%xmm0
> >>    8:   c5 fa 6f 4e 10          vmovdqu 0x10(%rsi),%xmm1
> >>    4:   c5 fa 7f 07             vmovdqu %xmm0,(%rdi)
> >>    d:   c5 fa 7f 4f 10          vmovdqu %xmm1,0x10(%rdi)
> >>   12:   c3                      ret
> >> 
> >> - anton  
> >
> >Try -march instead of -mavx2. E.g. -march=haswell
> >Sometimes gcc is beyond logic.  
> 
> For gcc -O3 -march=haswell I got the same result (with gcc-12).  I
> also tried -march=x86-64-v3 with the same result.
> 
> But gcc -O3 -march=x86-64-v4 produced:
> 

My gcc was 14.1 and -O2. It produced same code as yours below (forcase
of 32) with -march=haswell

> 0000000000000000 <foo1>:
>    0:   c5 fe 6f 06             vmovdqu (%rsi),%ymm0
>    4:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
>    8:   c5 f8 77                vzeroupper
>    b:   c3                      ret
>    c:   0f 1f 40 00             nopl   0x0(%rax)
> 
> 0000000000000010 <foo2>:
>   10:   c5 fe 6f 06             vmovdqu (%rsi),%ymm0
>   14:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
>   18:   c5 f8 77                vzeroupper
>   1b:   c3                      ret
> 
> And when changing the length to 64:
> 
> 0000000000000000 <foo1>:
>    0:   62 f1 fe 48 6f 06       vmovdqu64 (%rsi),%zmm0
>    6:   62 f1 fe 48 7f 07       vmovdqu64 %zmm0,(%rdi)
>    c:   c5 f8 77                vzeroupper
>    f:   c3                      ret
> 
> 0000000000000010 <foo2>:
>   10:   62 f1 fe 48 6f 06       vmovdqu64 (%rsi),%zmm0
>   16:   62 f1 fe 48 7f 07       vmovdqu64 %zmm0,(%rdi)
>   1c:   c5 f8 77                vzeroupper
>   1f:   c3                      ret
> 

And here I got different code for -march=tigerlake and
-march=znver4 despite both having approximately the same ISA.
It seems, for Toger Lake gcc is over-concerned about impact of
unaligned 64-bit accesses.

> But when changing the length to 63:
> 
> 0000000000000000 <foo1>:
>    0:   c5 fe 6f 06             vmovdqu (%rsi),%ymm0
>    4:   c5 fe 7f 07             vmovdqu %ymm0,(%rdi)
>    8:   c5 fe 6f 4e 1f          vmovdqu 0x1f(%rsi),%ymm1
>    d:   c5 fe 7f 4f 1f          vmovdqu %ymm1,0x1f(%rdi)
>   12:   c5 f8 77                vzeroupper
>   15:   c3                      ret
>   16:   66 2e 0f 1f 84 00 00    cs nopw 0x0(%rax,%rax,1)
>   1d:   00 00 00 
> 
> 0000000000000020 <foo2>:
>   20:   ba 3f 00 00 00          mov    $0x3f,%edx
>   25:   e9 00 00 00 00          jmp    2a <foo2+0xa>
> 
> - anton