Deutsch English Français Italiano |
<96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Stealing a Great Idea from the 6600 Date: Wed, 19 Jun 2024 16:11:20 +0000 Organization: Rocksolid Light Message-ID: <96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org> References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me> <v4f5de$2bfca$1@dont-email.me> <jwvzfrobxll.fsf-monnier+comp.arch@gnu.org> <v4f97o$2bu2l$1@dont-email.me> <613b9cb1a19b6439266f520e94e2046b@www.novabbs.org> <v4hsjk$2vk6n$1@dont-email.me> <6b5691e5e41d28d6cb48ff6257555cd4@www.novabbs.org> <v4tfu3$1ostn$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="510610"; mail-complaints-to="usenet@i2pn2.org"; posting-account="7opjq6o0gOhusEORo6KGlWDqrGdcQlz3IQ8pYKMWkuY"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Rslight-Site: $2y$10$Z7KFZN3eDp.aIXEKfhzbsuVIskjG1ydm823PTXkDTV4OOv.qTqYJC X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 4558 Lines: 99 BGB wrote: > On 6/18/2024 4:09 PM, MitchAlsup1 wrote: >> BGB wrote: >> >>> On 6/13/2024 3:40 PM, MitchAlsup1 wrote: >>> In this case, scheduling as-if it were an in-order core was leading to >>> better performance than a more naive ordering (such as directly using >>> the results of previous instructions or memory loads, vs shuffling >>> other >>> >>> instructions in between them). >> >>> Either way, seemed to be different behavior than seen on either the >>> Ryzen or on Intel Core based CPUs (where, seemingly, the CPU does not >>> care about the relative order). >> >> Because it had no requirement of code scheduling, unlike 1st generation >> >> RISCs, so the cores were designed to put up good performance scores >> without any code scheduling. >> > Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to > > instruction scheduling issues than either its predecessors (such as the > > Phenom II) and successors (Ryzen)?... They "blew" the microarchitecture. It was a 12-gate machine (down from 16-gates from Athlon). this puts a "lot more stuff" on critical paths and some forwarding was not done, particularly change in size between produced result and consumed operand. > Though, apparently "low IPC" was a noted issue with this processor > family (apparently trying to gain higher clock-speeds at the expense of > > IPC; using a 20-stage pipeline, ...). > Though, less obvious how having a longer pipeline than either its > predecessors or successors would effect instruction scheduling. > >> >> One of the things we found in Mc 88120 was that the compiler should >> NEVER >> be allowed to put unnecessary instructions in decode-execute slots that >> were unused--and that almost invariable--the best code for the GBOoO >> machine was almost invariably the one with the fewest instructions, and >> if several sequences had equally few instructions, it basically did not >> matter. >> >> For example:: >> >> for( i = 0; i < max, i++ ) >> a[i] = b[i]; >> >> was invariably faster than:: >> >> for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ ) >> *ap++ = *bp++; >> >> because the later has 3 ADDs in the loop wile the former has but 1. >> Because of this, I altered my programming style and almost never end up >> using ++ or -- anymore. > In this case, it would often be something more like: > maxn4=max&(~3); > for(i=0; i<maxn4; i+=4) > { > ap=a+i; bp=b+i; > t0=ap[0]; t1=ap[1]; > t2=ap[2]; t3=ap[3]; > bp[0]=t0; bp[1]=t1; > bp[2]=t2; bp[3]=t3; > } > if(max!=maxn4) > { > for(; i < max; i++ ) > a[i] = b[i]; > } That is what VVM does, without you having to lift a finger. > If things are partially or fully unrolled, they often go faster. And ALWAYS eat more code space. > Using a > large number of local variables seems to be effective (even in cases > where the number of local variables exceeds the number of CPU > registers). > Generally also using as few branches as possible. > Etc...