Deutsch   English   Français   Italiano  
<20250719152448.0000757a@tin.it>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: nntp.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: peter <peter.noreply@tin.it>
Newsgroups: comp.lang.forth
Subject: Re: Vector sum (was: Parsing timestamps?)
Date: Sat, 19 Jul 2025 15:24:48 +0200
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <20250719152448.0000757a@tin.it>
References: <1f433fabcb4d053d16cbc098dedc6c370608ac01@i2pn2.org>
	<2025Jul14.080413@mips.complang.tuwien.ac.at>
	<063d4a116fb394a776b1e9313f9903cf@www.novabbs.com>
	<2025Jul14.095004@mips.complang.tuwien.ac.at>
	<a449857495e02b4d35627f9f31d37fd8@www.novabbs.com>
	<2025Jul16.132504@mips.complang.tuwien.ac.at>
	<2025Jul16.173926@mips.complang.tuwien.ac.at>
	<20250717101400.000074f9@tin.it>
	<2025Jul17.145429@mips.complang.tuwien.ac.at>
	<20250717224825.00007b8c@tin.it>
	<2025Jul19.121815@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 19 Jul 2025 15:24:49 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="15beb201350ca6d14c7596c65cf3cde1";
	logging-data="2900934"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+LxnaE4NCYG4+6k8T4DkngVqqh5JVEoo8="
Cancel-Lock: sha1:uRs6zKOdxVrj60kabpAWHiUNB+k=
X-Newsreader: Claws Mail 4.3.0 (GTK 3.24.42; x86_64-w64-mingw32)

On Sat, 19 Jul 2025 10:18:15 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> peter <peter.noreply@tin.it> writes:
> >I did a test coding the sum128 as a code word with avx-512 instructions
> >and got the following results
> >
> >       285,584,376      cycles:u
> >       941,856,077      instructions:u 
> >
> >timing was
> >timer-reset ' recursive-sum bench .elapsed 51 ms elapsed
> >
> >so half the time of the original recursive.
> >with 32 zmm registers I could have done a sum256 also
> 
> One could do sum128 with just 8 registers by performing the adds ASAP,
> i.e., for sum32
> 
> vmovapd   zmm0,  [rbx]
> vmovapd   zmm1,  [rbx+64]
> vaddpd  zmm0, zmm0, zmm1
> vmovapd   zmm1,  [rbx+128]
> vmovapd   zmm2,  [rbx+192]
> vaddpd  zmm1, zmm1, zmm2
> vaddpd  zmm0, zmm0, zmm1
> ; and then the Horizontal sum
> 
> And you can code this as:
> 
> vmovapd   zmm0,  [rbx]
> vaddpd  zmm0, zmm0, [rbx+64]
> vmovapd   zmm1,  [rbx+128]
> vaddpd  zmm1, zmm1, [rbx+192]
> vaddpd  zmm0, zmm0, zmm1
> ; and then the Horizontal sum
> 
> >; Horizontal sum of zmm0
> >
> >vextractf64x4 ymm1, zmm0, 1        
> >vaddpd ymm2, ymm1, ymm0            
> >
> >vextractf64x2 xmm3, ymm2, 1
> >vaddpd ymm4, ymm3, ymm2
> >
> >vhaddpd xmm0, xmm4, xmm4

the simd instructions does also take a memory operand
I can du sum128 as

code asum128b

movsd [r13-0x8], xmm0
lea r13, [r13-0x8]

vmovapd zmm0,  [rbx]
vaddpd  zmm0, zmm0,  [rbx+64]
vaddpd  zmm0, zmm0,  [rbx+128]
vaddpd  zmm0, zmm0,  [rbx+192]
vaddpd  zmm0, zmm0,  [rbx+256]
vaddpd  zmm0, zmm0,  [rbx+320]
vaddpd  zmm0, zmm0,  [rbx+384]
vaddpd  zmm0, zmm0,  [rbx+448]
vaddpd  zmm0, zmm0,  [rbx+512]
vaddpd  zmm0, zmm0,  [rbx+576]
vaddpd  zmm0, zmm0,  [rbx+640]
vaddpd  zmm0, zmm0,  [rbx+704]
vaddpd  zmm0, zmm0,  [rbx+768]
vaddpd  zmm0, zmm0,  [rbx+832]
vaddpd  zmm0, zmm0,  [rbx+896]
vaddpd  zmm0, zmm0,  [rbx+960]


; Horizontal sum of zmm0

vextractf64x4 ymm1, zmm0, 1        
vaddpd ymm2, ymm1, ymm0            

vextractf64x2 xmm3, ymm2, 1
vaddpd ymm4, ymm3, ymm2

vpermilpd  xmm5, xmm4, 1
vaddsd xmm0, xmm4, xmm5


ret
end-code

this compiles to 154 bytes and 25 instructions
The original sum128 is 2157 bytes and 513 instructions!

Yes the horizontal sum should just be done once.
I have only replaced sum128 with simd as a test.
Later I will do a complete example

This asum128b does not change the timing but reduces 
the number of instructions

       277,333,790      cycles:u
       834,846,183      instructions:u    #    3.01  insn per cycle


> 
> Instead of doing the horizontal sum once for every sum128, it might be
> more efficient (assuming the whole thing is not
> cache-bandwidth-limited) to have the result of sum128 be a full SIMD
> width, and then add them up with vaddpd instead of addsd, and do the
> horizontal sum once in the end.
> 
> But if the recursive part is to be programmed in Forth, we would need
> a way to represent a SIMD width of data in Forth, maybe with a SIMD
> stack.  I see a few problems there:
> 
> * What to do about the mask registers of AVX-512?  In the RISC-V
>   vector extension masks are stored in regular SIMD registers.
> 
> * There is a trend visible in ARM SVE and the RISC-V Vector extension
>   to have support for dealing with loops across longer vectors.  Do we
>   also need to support something like that.
> 
> For the RISC-V vector extension, see
> <https://riscv.org/wp-content/uploads/2024/12/15.20-15.55-18.05.06.VEXT-bcn-v1.pdf>
> 
> One way to deal with all that would be to have a long-vector stack and
> have something like my vector wordset
> <https://github.com/AntonErtl/vectors>, where the sum of a vector
> would be a word that is implemented in some lower-level way (e.g.,
> assembly language); the sum of a vector is actually a planned, but not
> yet existing feature of this wordset.
> 
> An advantage of having a (short) SIMD stack would be that one could
> use SIMD operations for other uses where the long-vector wordset looks
> too heavy-weight (or would need optimizations to get rid of the
> long-vector overhead).  The question is if enough such uses exist to
> justify adding such a stack.
> 
> - anton

I will take a look at your vector implementation and see if it can be used
in lxf64

BR
Peter