| Deutsch English Français Italiano |
|
<20250719152448.0000757a@tin.it> View for Bookmarking (what is this?) Look up another Usenet article |
Path: nntp.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: peter <peter.noreply@tin.it>
Newsgroups: comp.lang.forth
Subject: Re: Vector sum (was: Parsing timestamps?)
Date: Sat, 19 Jul 2025 15:24:48 +0200
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <20250719152448.0000757a@tin.it>
References: <1f433fabcb4d053d16cbc098dedc6c370608ac01@i2pn2.org>
<2025Jul14.080413@mips.complang.tuwien.ac.at>
<063d4a116fb394a776b1e9313f9903cf@www.novabbs.com>
<2025Jul14.095004@mips.complang.tuwien.ac.at>
<a449857495e02b4d35627f9f31d37fd8@www.novabbs.com>
<2025Jul16.132504@mips.complang.tuwien.ac.at>
<2025Jul16.173926@mips.complang.tuwien.ac.at>
<20250717101400.000074f9@tin.it>
<2025Jul17.145429@mips.complang.tuwien.ac.at>
<20250717224825.00007b8c@tin.it>
<2025Jul19.121815@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Jul 2025 15:24:49 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="15beb201350ca6d14c7596c65cf3cde1";
logging-data="2900934"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+LxnaE4NCYG4+6k8T4DkngVqqh5JVEoo8="
Cancel-Lock: sha1:uRs6zKOdxVrj60kabpAWHiUNB+k=
X-Newsreader: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32)
On Sat, 19 Jul 2025 10:18:15 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> peter <peter.noreply@tin.it> writes:
> >I did a test coding the sum128 as a code word with avx-512 instructions
> >and got the following results
> >
> > 285,584,376 cycles:u
> > 941,856,077 instructions:u
> >
> >timing was
> >timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
> >
> >so half the time of the original recursive.
> >with 32 zmm registers I could have done a sum256 also
>
> One could do sum128 with just 8 registers by performing the adds ASAP,
> i.e., for sum32
>
> vmovapd zmm0, [rbx]
> vmovapd zmm1, [rbx+64]
> vaddpd zmm0, zmm0, zmm1
> vmovapd zmm1, [rbx+128]
> vmovapd zmm2, [rbx+192]
> vaddpd zmm1, zmm1, zmm2
> vaddpd zmm0, zmm0, zmm1
> ; and then the Horizontal sum
>
> And you can code this as:
>
> vmovapd zmm0, [rbx]
> vaddpd zmm0, zmm0, [rbx+64]
> vmovapd zmm1, [rbx+128]
> vaddpd zmm1, zmm1, [rbx+192]
> vaddpd zmm0, zmm0, zmm1
> ; and then the Horizontal sum
>
> >; Horizontal sum of zmm0
> >
> >vextractf64x4 ymm1, zmm0, 1
> >vaddpd ymm2, ymm1, ymm0
> >
> >vextractf64x2 xmm3, ymm2, 1
> >vaddpd ymm4, ymm3, ymm2
> >
> >vhaddpd xmm0, xmm4, xmm4
the simd instructions does also take a memory operand
I can du sum128 as
code asum128b
movsd [r13-0x8], xmm0
lea r13, [r13-0x8]
vmovapd zmm0, [rbx]
vaddpd zmm0, zmm0, [rbx+64]
vaddpd zmm0, zmm0, [rbx+128]
vaddpd zmm0, zmm0, [rbx+192]
vaddpd zmm0, zmm0, [rbx+256]
vaddpd zmm0, zmm0, [rbx+320]
vaddpd zmm0, zmm0, [rbx+384]
vaddpd zmm0, zmm0, [rbx+448]
vaddpd zmm0, zmm0, [rbx+512]
vaddpd zmm0, zmm0, [rbx+576]
vaddpd zmm0, zmm0, [rbx+640]
vaddpd zmm0, zmm0, [rbx+704]
vaddpd zmm0, zmm0, [rbx+768]
vaddpd zmm0, zmm0, [rbx+832]
vaddpd zmm0, zmm0, [rbx+896]
vaddpd zmm0, zmm0, [rbx+960]
; Horizontal sum of zmm0
vextractf64x4 ymm1, zmm0, 1
vaddpd ymm2, ymm1, ymm0
vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2
vpermilpd xmm5, xmm4, 1
vaddsd xmm0, xmm4, xmm5
ret
end-code
this compiles to 154 bytes and 25 instructions
The original sum128 is 2157 bytes and 513 instructions!
Yes the horizontal sum should just be done once.
I have only replaced sum128 with simd as a test.
Later I will do a complete example
This asum128b does not change the timing but reduces
the number of instructions
277,333,790 cycles:u
834,846,183 instructions:u # 3.01 insn per cycle
>
> Instead of doing the horizontal sum once for every sum128, it might be
> more efficient (assuming the whole thing is not
> cache-bandwidth-limited) to have the result of sum128 be a full SIMD
> width, and then add them up with vaddpd instead of addsd, and do the
> horizontal sum once in the end.
>
> But if the recursive part is to be programmed in Forth, we would need
> a way to represent a SIMD width of data in Forth, maybe with a SIMD
> stack. I see a few problems there:
>
> * What to do about the mask registers of AVX-512? In the RISC-V
> vector extension masks are stored in regular SIMD registers.
>
> * There is a trend visible in ARM SVE and the RISC-V Vector extension
> to have support for dealing with loops across longer vectors. Do we
> also need to support something like that.
>
> For the RISC-V vector extension, see
> <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>
>
> One way to deal with all that would be to have a long-vector stack and
> have something like my vector wordset
> <https://github.com/AntonErtl/vectors>, where the sum of a vector
> would be a word that is implemented in some lower-level way (e.g.,
> assembly language); the sum of a vector is actually a planned, but not
> yet existing feature of this wordset.
>
> An advantage of having a (short) SIMD stack would be that one could
> use SIMD operations for other uses where the long-vector wordset looks
> too heavy-weight (or would need optimizations to get rid of the
> long-vector overhead). The question is if enough such uses exist to
> justify adding such a stack.
>
> - anton
I will take a look at your vector implementation and see if it can be used
in lxf64
BR
Peter