Deutsch English Français Italiano |
<uvir8v$6ua2$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen <terje.mathisen@tmsw.no> Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Mon, 15 Apr 2024 11:16:15 +0200 Organization: A noiseless patient Spider Lines: 127 Message-ID: <uvir8v$6ua2$1@dont-email.me> References: <uuk100$inj$1@dont-email.me> <2024Apr3.192405@mips.complang.tuwien.ac.at> <86d1dd03deee83e339afa725524ab259@www.novabbs.org> <uvimv7$629s$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Injection-Date: Mon, 15 Apr 2024 11:16:15 +0200 (CEST) Injection-Info: dont-email.me; posting-host="9d6aa58f39643660529f6affbcde0704"; logging-data="227650"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+XOWurrERRSJPBlkLLdKolSOd5ddwLJK6xGdRMMxwg7Q==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2 Cancel-Lock: sha1:N481Kg8dt1yuHmQ5US66IxCdhlQ= In-Reply-To: <uvimv7$629s$1@dont-email.me> Bytes: 5761 Terje Mathisen wrote: > MitchAlsup1 wrote: >> Anton Ertl wrote: >> >>> I have a similar problem for the carry and overflow bits in >>> < http://www.complang.tuwien.ac.at/anton/tmp/carry.pdf >, and chose t= o >>> let those bits not survive across calls; if there was a cheap solutio= n >>> for the problem, it would eliminate this drawback of my idea. >> >> My 66000 ISA can encode the mpn_add_n() inner loop in 5-instructions >> whereas RISC-V encodes the inner loop in 11 instructions. >> >> Source code: >> >> void mpn_add_n( uint64_t sum, uint64_t a, unit64_t b, int n ) >> { >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 uint64_t c =3D 0; >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 for( int i =3D 0; i < n; i+= + ) >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 { >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82= =C2=A0=C3=82=C2=A0=C3=82=C2=A0 {c, sum[i]} =3D a[i] + b[i] + c; >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 } >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 return >> } >> >> Assembly code:: >> >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 .global mpn_add_n >> mpn_add_n: >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 MOV=C3=82=C2=A0=C3=82=C2=A0= R5,#0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 // c >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 MOV=C3=82=C2=A0=C3=82=C2=A0= R6,#0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 // i >> >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 VEC=C3=82=C2=A0=C3=82=C2=A0= R7,{} >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LDD=C3=82=C2=A0=C3=82=C2=A0= R8,[R2,Ri<<3] >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LDD=C3=82=C2=A0=C3=82=C2=A0= R9,[R3,Ri<<3] >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 CARRY R5,{{IO}} >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 ADD=C3=82=C2=A0=C3=82=C2=A0= R10,R8,R9 >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 STD=C3=82=C2=A0=C3=82=C2=A0= R10,[R1,Ri<<3] >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 LOOP=C3=82=C2=A0 LT,R6,#1,R= 4 >> =C2=A0=C3=82=C2=A0=C3=82=C2=A0=C3=82=C2=A0 RET >> >> So, adding a few "bells and whistles" to RISC-V does give you a >> performance gain (1.38=C3=83=C6=92=C3=A2=E2=82=AC=E2=80=9D); using a w= ell designed ISA gives you a >> performance gain of 2.00=C3=83=C6=92=C3=A2=E2=82=AC=E2=80=9D !! {{mora= l: don't stop too early}} >> >> Note that all the register bookkeeping has disappeared !! because >> of the indexed memory reference form. >> >> As I count executing instructions, VEC does not execute, nor does >> CARRY--CARRY causes the subsequent ADD to take C input as carry and >> the carry produced by ADD goes back in C. Loop performs the ADD-CMP- >> BC sequence in a single instruction and in a single clock. >=20 > =C2=A0 ; RSI->a[n], RDX->b[n], RDI->sum[n], RCX=3D-n > =C2=A0 xor rax,rax ;; Clear carry > next: > =C2=A0 mov rax,[rsi+rcx*8] > =C2=A0 adc rax,[rdx+rcx*8] > =C2=A0 mov [rdi+rcx*8],rax > =C2=A0 inc rcx > =C2=A0=C2=A0 jnz next >=20 > The code above is 5 instructions, or 6 if we avoid the load-op, doing=20 > two loads and one store, so it should only be limited by the latency of= =20 > the ADC, i.e. one or two cycles. >=20 > In the non-OoO (i.e Pentium) days, I would have inverted the loop in=20 > order to hide the latencies as much as possible, resulting in an inner = > loop something like this: >=20 > =C2=A0next: > =C2=A0 adc eax,ebx > =C2=A0 mov ebx,[edx+ecx*4]=C2=A0=C2=A0=C2=A0 ; First cycle >=20 > =C2=A0 mov [edi+ecx*4],eax > =C2=A0 mov eax,[esi+ecx*4]=C2=A0=C2=A0=C2=A0 ; Second cycle >=20 > =C2=A0 inc ecx > =C2=A0=C2=A0 jnz next=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ; Thir= d cycle >=20 In the same bad old days, the standard way to speed it up would have=20 used unrolling, but until we got more registers, it would have stopped=20 itself very quickly. With AVX2 we could use 4 64-bit slots in a 32-byte=20 register, but then we would have needed to handle the carry propagation=20 manually, and that would take longer than a series of ADC/ADX instruction= s. next4: mov eax,[esi] adc eax,[esi+edx] mov [esi+edi],eax mov eax,[esi+4] adc eax,[esi+edx+4] mov [esi+edi+4],eax mov eax,[esi+8] adc eax,[esi+edx+8] mov [esi+edi+8],eax mov eax,[esi+12] adc eax,[esi+edx+12] mov [esi+edi+12],eax lea esi,[esi+16] dec ecx jnz next4 Terje --=20 - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"