Article <uvimv7$629s$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <uvimv7$629s$1@dont-email.me>

Deutsch English Français Italiano

<uvimv7$629s$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Mon, 15 Apr 2024 10:02:46 +0200
Organization: A noiseless patient Spider
Lines: 89
Message-ID: <uvimv7$629s$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
 <2024Apr3.192405@mips.complang.tuwien.ac.at>
 <86d1dd03deee83e339afa725524ab259@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: quoted-printable
Injection-Date: Mon, 15 Apr 2024 10:02:47 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9d6aa58f39643660529f6affbcde0704";
	logging-data="198972"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18Q8obOmFV0bmQWPLwD+xHwmwY2B6ywk06mpuem5bpdhg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:iik6fZTZy6fBNvgzpmfOpcp6JM4=
In-Reply-To: <86d1dd03deee83e339afa725524ab259@www.novabbs.org>
Bytes: 4132

MitchAlsup1 wrote:
> Anton Ertl wrote:
>=20
>> I have a similar problem for the carry and overflow bits in
>> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose to=

>> let those bits not survive across calls; if there was a cheap solution=

>> for the problem, it would eliminate this drawback of my idea.
>=20
> My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions
> whereas RISC-V encodes the inner loop in 11 instructions.
>=20
> Source code:
>=20
> void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n )
> {
>  =C2=A0=C2=A0=C2=A0 uint64_t c =3D 0;
>  =C2=A0=C2=A0=C2=A0 for( int i =3D 0; i < n; i++ )
>  =C2=A0=C2=A0=C2=A0 {
>  =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 {c, sum[i]} =3D a[i] =
+ b[i] + c;
>  =C2=A0=C2=A0=C2=A0 }
>  =C2=A0=C2=A0=C2=A0 return
> }
>=20
> Assembly code::
>=20
>  =C2=A0=C2=A0=C2=A0 .global mpn_add_n
> mpn_add_n:
>  =C2=A0=C2=A0=C2=A0 MOV=C2=A0=C2=A0 R5,#0=C2=A0=C2=A0=C2=A0=C2=A0 // c
>  =C2=A0=C2=A0=C2=A0 MOV=C2=A0=C2=A0 R6,#0=C2=A0=C2=A0=C2=A0=C2=A0 // i
>=20
>  =C2=A0=C2=A0=C2=A0 VEC=C2=A0=C2=A0 R7,{}
>  =C2=A0=C2=A0=C2=A0 LDD=C2=A0=C2=A0 R8,[R2,Ri<<3]
>  =C2=A0=C2=A0=C2=A0 LDD=C2=A0=C2=A0 R9,[R3,Ri<<3]
>  =C2=A0=C2=A0=C2=A0 CARRY R5,{{IO}}
>  =C2=A0=C2=A0=C2=A0 ADD=C2=A0=C2=A0 R10,R8,R9
>  =C2=A0=C2=A0=C2=A0 STD=C2=A0=C2=A0 R10,[R1,Ri<<3]
>  =C2=A0=C2=A0=C2=A0 LOOP=C2=A0 LT,R6,#1,R4
>  =C2=A0=C2=A0=C2=A0 RET
>=20
> So, adding a few "bells and whistles" to RISC-V does give you a
> performance gain (1.38=C3=83=E2=80=94); using a well designed ISA gives=
 you a
> performance gain of 2.00=C3=83=E2=80=94 !! {{moral: don't stop too earl=
y}}
>=20
> Note that all the register bookkeeping has disappeared !! because
> of the indexed memory reference form.
>=20
> As I count executing instructions, VEC does not execute, nor does
> CARRY--CARRY causes the subsequent ADD to take C input as carry and
> the carry produced by ADD goes back in C. Loop performs the ADD-CMP-
> BC sequence in a single instruction and in a single clock.

   ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=3D-n
   xor rax,rax ;; Clear carry
next:
   mov rax,[rsi+rcx*8]
   adc rax,[rdx+rcx*8]
   mov [rdi+rcx*8],rax
   inc rcx
    jnz next

The code above is 5 instructions, or 6 if we avoid the load-op, doing=20
two loads and one store, so it should only be limited by the latency of=20
the ADC, i.e. one or two cycles.

In the non-OoO (i.e Pentium) days, I would have inverted the loop in=20
order to hide the latencies as much as possible, resulting in an inner=20
loop something like this:

  next:
   adc eax,ebx
   mov ebx,[edx+ecx*4]	; First cycle

   mov [edi+ecx*4],eax
   mov eax,[esi+ecx*4]	; Second cycle

   inc ecx
    jnz next		; Third cycle

Terje

--=20
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"