Path: ...!news.misty.com!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 20:55:53 +0000
Organization: Rocksolid Light
Message-ID: <983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org>
References: <uuk100$inj$1@dont-email.me> <2024Apr3.192405@mips.complang.tuwien.ac.at> <86d1dd03deee83e339afa725524ab259@www.novabbs.org> <uvimv7$629s$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="1293738"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo";
User-Agent: Rocksolid Light
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$lKDbwk.OhCpIcHiJMkRdDeinikQmG4HIflwcshee4yEDebUO1SqHu
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
Bytes: 2792
Lines: 62

Terje Mathisen wrote:

> MitchAlsup1 wrote:
>> 

> In the non-OoO (i.e Pentium) days, I would have inverted the loop in 
> order to hide the latencies as much as possible, resulting in an inner 
> loop something like this:

>   next:
>    adc eax,ebx
>    mov ebx,[edx+ecx*4]	; First cycle

>    mov [edi+ecx*4],eax
>    mov eax,[esi+ecx*4]	; Second cycle

>    inc ecx
>    jnz next		; Third cycle

> Terje

As opposed to::

     .global mpn_add_n
mpn_add_n:
     MOV   R5,#0     // c
     MOV   R6,#0     // i

     VEC   R7,{}
     LDD   R8,[R2,Ri<<3]       // Load 128-to-512 bits
     LDD   R9,[R3,Ri<<3]       // Load 128-to-512 bits
     CARRY R5,{{IO}}
     ADD   R10,R8,R9           // Add pair to add octal
     STD   R10,[R1,Ri<<3]      // Store 128-to-512 bits
     LOOP  LT,R6,#1,R4         // increment 2-to-8 times
     RET

--------------------------------------------------------

     LDD   R8,[R2,Ri<<3]       // AGEN cycle 1
     LDD   R9,[R3,Ri<<3]       // AGEN cycle 2 data cycle 4
     CARRY R5,{{IO}}
     ADD   R10,R8,R9           // cycle 4
     STD   R10,[R1,Ri<<3]      // AGEN cycle 3 write cycle 5
     LOOP  LT,R6,#1,R4         // cycle 3

OR

     LDD       LDd
          LDD       LDd 
                    ADD
               ST        STd
               LOOP
                    LDD       LDd
                         LDD       LDd 
                                   ADD
                              ST        STd
                              LOOP

10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !!
without code scheduling heroics.

40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!