Article <uvir8v$6ua2$1@dont-email.me>

Deutsch English Français Italiano
<uvir8v$6ua2$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 11:16:15 +0200
Organization: A noiseless patient Spider
Lines: 127
Message-ID: <uvir8v$6ua2$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
 <2024Apr3.192405@mips.complang.tuwien.ac.at>
 <86d1dd03deee83e339afa725524ab259@www.novabbs.org>
 <uvimv7$629s$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Mon, 15 Apr 2024 11:16:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9d6aa58f39643660529f6affbcde0704";
	logging-data="227650"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+XOWurrERRSJPBlkLLdKolSOd5ddwLJK6xGdRMMxwg7Q=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:N481Kg8dt1yuHmQ5US66IxCdhlQ=
In-Reply-To: <uvimv7$629s$1@dont-email.me>
Bytes: 5761

Terje Mathisen wrote:
> MitchAlsup1 wrote:
>> Anton Ertl wrote:
>>
>>> I have a similar problem for the carry and overflow bits in
>>> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose t=
o
>>> let those bits not survive across calls; if there was a cheap solutio=
n
>>> for the problem, it would eliminate this drawback of my idea.
>>
>> My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
>> whereas RISC-V encodes the inner loop in 11 instructions.
>>
>> Source code:
>>
>> void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
>> {
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 uint64_t c =3D 0;
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 for( int i =3D 0; i < n; i+=
+ )
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 {
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=
=C2=A0=C3=82=C2=A0=C3=82=C2=A0 {c, sum[i]} =3D a[i] + b[i] + c;
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 }
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 return
>> }
>>
>> Assembly code::
>>
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 .global mpn_add_n
>> mpn_add_n:
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 MOV=C3=82=C2=A0=C3=82=C2=A0=
 R5,#0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 // c
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 MOV=C3=82=C2=A0=C3=82=C2=A0=
 R6,#0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 // i
>>
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 VEC=C3=82=C2=A0=C3=82=C2=A0=
 R7,{}
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LDD=C3=82=C2=A0=C3=82=C2=A0=
 R8,[R2,Ri<<3]
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LDD=C3=82=C2=A0=C3=82=C2=A0=
 R9,[R3,Ri<<3]
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 CARRY R5,{{IO}}
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 ADD=C3=82=C2=A0=C3=82=C2=A0=
 R10,R8,R9
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 STD=C3=82=C2=A0=C3=82=C2=A0=
 R10,[R1,Ri<<3]
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LOOP=C3=82=C2=A0 LT,R6,#1,R=
4
>> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 RET
>>
>> So, adding a few "bells and whistles" to RISC-V does give you a
>> performance gain (1.38=C3=83=C6=92=C3=A2=E2=82=AC=E2=80=9D); using a w=
ell designed ISA gives you a
>> performance gain of 2.00=C3=83=C6=92=C3=A2=E2=82=AC=E2=80=9D !! {{mora=
l: don't stop too early}}
>>
>> Note that all the register bookkeeping has disappeared !! because
>> of the indexed memory reference form.
>>
>> As I count executing instructions, VEC does not execute, nor does
>> CARRY--CARRY causes the subsequent ADD to take C input as carry and
>> the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
>> BC sequence in a single instruction and in a single clock.
>=20
>  =C2=A0 ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=3D-n
>  =C2=A0 xor rax,rax ;; Clear carry
> next:
>  =C2=A0 mov rax,[rsi+rcx*8]
>  =C2=A0 adc rax,[rdx+rcx*8]
>  =C2=A0 mov [rdi+rcx*8],rax
>  =C2=A0 inc rcx
>  =C2=A0=C2=A0 jnz next
>=20
> The code above is 5 instructions, or 6 if we avoid the load-op, doing=20
> two loads and one store, so it should only be limited by the latency of=
=20
> the ADC, i.e. one or two cycles.
>=20
> In the non-OoO (i.e Pentium) days, I would have inverted the loop in=20
> order to hide the latencies as much as possible, resulting in an inner =

> loop something like this:
>=20
>  =C2=A0next:
>  =C2=A0 adc eax,ebx
>  =C2=A0 mov ebx,[edx+ecx*4]=C2=A0=C2=A0=C2=A0 ; First cycle
>=20
>  =C2=A0 mov [edi+ecx*4],eax
>  =C2=A0 mov eax,[esi+ecx*4]=C2=A0=C2=A0=C2=A0 ; Second cycle
>=20
>  =C2=A0 inc ecx
>  =C2=A0=C2=A0 jnz next=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ; Thir=
d cycle
>=20

In the same bad old days, the standard way to speed it up would have=20
used unrolling, but until we got more registers, it would have stopped=20
itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte=20
register, but then we would have needed to handle the carry propagation=20
manually, and that would take longer than a series of ADC/ADX instruction=
s.

next4:
   mov eax,[esi]
   adc eax,[esi+edx]
   mov [esi+edi],eax
   mov eax,[esi+4]
   adc eax,[esi+edx+4]
   mov [esi+edi+4],eax
   mov eax,[esi+8]
   adc eax,[esi+edx+8]
   mov [esi+edi+8],eax
   mov eax,[esi+12]
   adc eax,[esi+edx+12]
   mov [esi+edi+12],eax
   lea esi,[esi+16]
   dec ecx
    jnz next4

Terje

--=20
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"