Path: ...!news.misty.com!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Mon, 15 Apr 2024 20:55:53 +0000 Organization: Rocksolid Light Message-ID: <983c789e7c6d9f3ca4ffe40fdc3aa709@www.novabbs.org> References: <2024Apr3.192405@mips.complang.tuwien.ac.at> <86d1dd03deee83e339afa725524ab259@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="1293738"; mail-complaints-to="usenet@i2pn2.org"; posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo"; User-Agent: Rocksolid Light X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$lKDbwk.OhCpIcHiJMkRdDeinikQmG4HIflwcshee4yEDebUO1SqHu X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 Bytes: 2792 Lines: 62 Terje Mathisen wrote: > MitchAlsup1 wrote: >> > In the non-OoO (i.e Pentium) days, I would have inverted the loop in > order to hide the latencies as much as possible, resulting in an inner > loop something like this: > next: > adc eax,ebx > mov ebx,[edx+ecx*4] ; First cycle > mov [edi+ecx*4],eax > mov eax,[esi+ecx*4] ; Second cycle > inc ecx > jnz next ; Third cycle > Terje As opposed to:: .global mpn_add_n mpn_add_n: MOV R5,#0 // c MOV R6,#0 // i VEC R7,{} LDD R8,[R2,Ri<<3] // Load 128-to-512 bits LDD R9,[R3,Ri<<3] // Load 128-to-512 bits CARRY R5,{{IO}} ADD R10,R8,R9 // Add pair to add octal STD R10,[R1,Ri<<3] // Store 128-to-512 bits LOOP LT,R6,#1,R4 // increment 2-to-8 times RET -------------------------------------------------------- LDD R8,[R2,Ri<<3] // AGEN cycle 1 LDD R9,[R3,Ri<<3] // AGEN cycle 2 data cycle 4 CARRY R5,{{IO}} ADD R10,R8,R9 // cycle 4 STD R10,[R1,Ri<<3] // AGEN cycle 3 write cycle 5 LOOP LT,R6,#1,R4 // cycle 3 OR LDD LDd LDD LDd ADD ST STd LOOP LDD LDd LDD LDd ADD ST STd LOOP 10 instructions (2 iterations) in 4 clocks on a 64-bit 1-wide VVM machine !! without code scheduling heroics. 40 instructions (8 iterations) in 4 clocks on a 512 wide SIMD VVM machine !!