Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.lang.forth Subject: Re: Floating point implementations on AMD64 Date: Sat, 20 Apr 2024 15:58:03 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 60 Message-ID: <2024Apr20.175803@mips.complang.tuwien.ac.at> References: <2024Apr13.195518@mips.complang.tuwien.ac.at> <2024Apr14.132507@mips.complang.tuwien.ac.at> <661bdb9b$1@news.ausics.net> <2024Apr14.175340@mips.complang.tuwien.ac.at> <661c8bf4@news.ausics.net> <2024Apr15.160928@mips.complang.tuwien.ac.at> <7cdb25647298a495f7c754a2abfa69cd@www.novabbs.com> <696e375851b1434dd8763bea3a3fce77@www.novabbs.com> <2c6c4fc71ca6483ce9c05c0079a371a7@www.novabbs.com> Injection-Date: Sat, 20 Apr 2024 18:22:51 +0200 (CEST) Injection-Info: dont-email.me; posting-host="3b6c858b3e1f294aa7f65d00eb03e33b"; logging-data="3956107"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+SE9bZbKRjJnef47V6Iu4R" Cancel-Lock: sha1:4Ae8HwkH5VnmcIbOiFhBqPzWDd4= X-newsreader: xrn 10.11 Bytes: 3782 mhx@iae.nl (mhx) writes: >minforth wrote: >> But I think the main advantage lies in the possibility of parallel and/or >> vectorized execution. > >I have not yet seen algorithms where that would bring something. Matrix multiplication is an easy case. I have also done a version of Jon Bentley's greedy TSP program that benefitted from SSE and AVX; I had to use assembly language to do this, however; see the thread starting at <2016Nov14.164726@mips.complang.tuwien.ac.at>. OTOH, yesterday I saw what gcc did for the inner loop of the bubble benchmark from the Stanford integer benchmarks: while ( i sortlist[i+1] ) { j = sortlist[i]; sortlist[i] = sortlist[i+1]; sortlist[i+1] = j; }; i=i+1; }; top=top-1; }; gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants to use SIMD instructions: gcc -01 gcc -O3 1c: add $0x4,%rax c0: movq (%rax),%xmm0 cmp %rsi,%rax add $0x1,%edx je 35 pshufd $0xe5,%xmm0,%xmm1 25: mov (%rax),%edx movd %xmm0,%edi mov 0x4(%rax),%ecx movd %xmm1,%ecx cmp %ecx,%edx cmp %ecx,%edi jle 1c jle e1 mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0 mov %edx,0x4(%rax) movq %xmm0,(%rax) jmp 1c e1: add $0x4,%rax 35: cmp %r8d,%edx jl c0 The version produced by gcc -O3 is almost three times slower on a Skylake than the one by gcc -O1 and is actually slower than several Forth systems, including gforth-fast. I think that the reason is that the movq towards the end stores two items, and the movq at the start of the next iteration loads one of these item, i.e., there is partial overlap between the store and the load. In this case the hardware takes a slow path, which means that the slowdown is much bigger than the instruction count suggests. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: https://forth-standard.org/ EuroForth 2023: https://euro.theforth.net/2023