Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Floating point implementations on AMD64
Date: Sat, 20 Apr 2024 15:58:03 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 60
Message-ID: <2024Apr20.175803@mips.complang.tuwien.ac.at>
References: <2024Apr13.195518@mips.complang.tuwien.ac.at> <2024Apr14.132507@mips.complang.tuwien.ac.at> <661bdb9b$1@news.ausics.net> <2024Apr14.175340@mips.complang.tuwien.ac.at> <661c8bf4@news.ausics.net> <2024Apr15.160928@mips.complang.tuwien.ac.at> <7cdb25647298a495f7c754a2abfa69cd@www.novabbs.com> <696e375851b1434dd8763bea3a3fce77@www.novabbs.com> <2c6c4fc71ca6483ce9c05c0079a371a7@www.novabbs.com>
Injection-Date: Sat, 20 Apr 2024 18:22:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3b6c858b3e1f294aa7f65d00eb03e33b";
logging-data="3956107"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+SE9bZbKRjJnef47V6Iu4R"
Cancel-Lock: sha1:4Ae8HwkH5VnmcIbOiFhBqPzWDd4=
X-newsreader: xrn 10.11
Bytes: 3782
mhx@iae.nl (mhx) writes:
>minforth wrote:
>> But I think the main advantage lies in the possibility of parallel and/or
>> vectorized execution.
>
>I have not yet seen algorithms where that would bring something.
Matrix multiplication is an easy case. I have also done a version of
Jon Bentley's greedy TSP program that benefitted from SSE and AVX; I
had to use assembly language to do this, however; see the thread
starting at <2016Nov14.164726@mips.complang.tuwien.ac.at>.
OTOH, yesterday I saw what gcc did for the inner loop of the bubble
benchmark from the Stanford integer benchmarks:
while ( i sortlist[i+1] ) {
j = sortlist[i];
sortlist[i] = sortlist[i+1];
sortlist[i+1] = j;
};
i=i+1;
};
top=top-1;
};
gcc-12.2 -O1 produces straighforward scalar code, gcc-12.2 -O3 wants
to use SIMD instructions:
gcc -01 gcc -O3
1c: add $0x4,%rax c0: movq (%rax),%xmm0
cmp %rsi,%rax add $0x1,%edx
je 35 pshufd $0xe5,%xmm0,%xmm1
25: mov (%rax),%edx movd %xmm0,%edi
mov 0x4(%rax),%ecx movd %xmm1,%ecx
cmp %ecx,%edx cmp %ecx,%edi
jle 1c jle e1
mov %ecx,(%rax) pshufd $0xe1,%xmm0,%xmm0
mov %edx,0x4(%rax) movq %xmm0,(%rax)
jmp 1c e1: add $0x4,%rax
35: cmp %r8d,%edx
jl c0
The version produced by gcc -O3 is almost three times slower on a
Skylake than the one by gcc -O1 and is actually slower than several
Forth systems, including gforth-fast. I think that the reason is that
the movq towards the end stores two items, and the movq at the start
of the next iteration loads one of these item, i.e., there is partial
overlap between the store and the load. In this case the hardware
takes a slow path, which means that the slowdown is much bigger than
the instruction count suggests.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2023: https://euro.theforth.net/2023