Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Michael S Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Mon, 9 Sep 2024 16:08:47 +0300 Organization: A noiseless patient Spider Lines: 116 Message-ID: <20240909160847.000062a2@yahoo.com> References: <2024Aug30.161204@mips.complang.tuwien.ac.at> <86v7zep35n.fsf@linuxsc.com> <20240902180903.000035ee@yahoo.com> <20240903190928.00002f92@yahoo.com> <86seufo11j.fsf@linuxsc.com> <1246395e530759ac79805e45b3830d8f@www.novabbs.org> <8634m9lga1.fsf@linuxsc.com> <20240909122219.00007f81@yahoo.com> <2024Sep9.123034@mips.complang.tuwien.ac.at> <20240909145854.00001e4e@yahoo.com> <2024Sep9.142813@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Date: Mon, 09 Sep 2024 15:08:25 +0200 (CEST) Injection-Info: dont-email.me; posting-host="45fff2496b15112b5e4e03cadfa28742"; logging-data="2041887"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fS8LQ6jm+GADwVcF81w50WlFQMzTZunc=" Cancel-Lock: sha1:3UOnROMuBLgZyFOLpDzFRSUkX34= X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32) Bytes: 5611 On Mon, 09 Sep 2024 12:28:13 GMT anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > Michael S writes: > >On Mon, 09 Sep 2024 10:30:34 GMT > >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > >> One would hope so, but here's what happens with gcc-12: > >> > >> #include > >> > >> void foo1(char *p, char* q) > >> { > >> memcpy(p,q,32); > >> } > >> > >> void foo2(char *p, char* q) > >> { > >> memmove(p,q,32); > >> } > >> > >> gcc -O3 -mavx2 -c -Wall xxx-memmove.c ; objdump -d xxx-memmove.o: > >> > >> 0000000000000000 : > >> 0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0 > >> 4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi) > >> 8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1 > >> d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi) > >> 12: c3 ret > >> 13: 66 66 2e 0f 1f 84 00 data16 cs nopw 0x0(%rax,%rax,1) > >> 1a: 00 00 00 00 > >> 1e: 66 90 xchg %ax,%ax > >> > >> 0000000000000020 : > >> 20: ba 20 00 00 00 mov $0x20,%edx > >> 25: e9 00 00 00 00 jmp 2a > >> > >> The jmp in line 25 is probably a tail-call to memmove(). > >> > >> My guess is that xmm registers and unrolling are used here rather > >> than ymm registers because waking up the second 128 bits takes > >> time. But even with that, the code uses two different registers, > >> and if scheduled differently, could be used for implementing > >> foo2(): > >> > >> 0: c5 fa 6f 06 vmovdqu (%rsi),%xmm0 > >> 8: c5 fa 6f 4e 10 vmovdqu 0x10(%rsi),%xmm1 > >> 4: c5 fa 7f 07 vmovdqu %xmm0,(%rdi) > >> d: c5 fa 7f 4f 10 vmovdqu %xmm1,0x10(%rdi) > >> 12: c3 ret > >> > >> - anton > > > >Try -march instead of -mavx2. E.g. -march=haswell > >Sometimes gcc is beyond logic. > > For gcc -O3 -march=haswell I got the same result (with gcc-12). I > also tried -march=x86-64-v3 with the same result. > > But gcc -O3 -march=x86-64-v4 produced: > My gcc was 14.1 and -O2. It produced same code as yours below (forcase of 32) with -march=haswell > 0000000000000000 : > 0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0 > 4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) > 8: c5 f8 77 vzeroupper > b: c3 ret > c: 0f 1f 40 00 nopl 0x0(%rax) > > 0000000000000010 : > 10: c5 fe 6f 06 vmovdqu (%rsi),%ymm0 > 14: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) > 18: c5 f8 77 vzeroupper > 1b: c3 ret > > And when changing the length to 64: > > 0000000000000000 : > 0: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0 > 6: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi) > c: c5 f8 77 vzeroupper > f: c3 ret > > 0000000000000010 : > 10: 62 f1 fe 48 6f 06 vmovdqu64 (%rsi),%zmm0 > 16: 62 f1 fe 48 7f 07 vmovdqu64 %zmm0,(%rdi) > 1c: c5 f8 77 vzeroupper > 1f: c3 ret > And here I got different code for -march=tigerlake and -march=znver4 despite both having approximately the same ISA. It seems, for Toger Lake gcc is over-concerned about impact of unaligned 64-bit accesses. > But when changing the length to 63: > > 0000000000000000 : > 0: c5 fe 6f 06 vmovdqu (%rsi),%ymm0 > 4: c5 fe 7f 07 vmovdqu %ymm0,(%rdi) > 8: c5 fe 6f 4e 1f vmovdqu 0x1f(%rsi),%ymm1 > d: c5 fe 7f 4f 1f vmovdqu %ymm1,0x1f(%rdi) > 12: c5 f8 77 vzeroupper > 15: c3 ret > 16: 66 2e 0f 1f 84 00 00 cs nopw 0x0(%rax,%rax,1) > 1d: 00 00 00 > > 0000000000000020 : > 20: ba 3f 00 00 00 mov $0x3f,%edx > 25: e9 00 00 00 00 jmp 2a > > - anton