| Deutsch English Français Italiano |
|
<96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Wed, 19 Jun 2024 16:11:20 +0000
Organization: Rocksolid Light
Message-ID: <96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me> <v4f5de$2bfca$1@dont-email.me> <jwvzfrobxll.fsf-monnier+comp.arch@gnu.org> <v4f97o$2bu2l$1@dont-email.me> <613b9cb1a19b6439266f520e94e2046b@www.novabbs.org> <v4hsjk$2vk6n$1@dont-email.me> <6b5691e5e41d28d6cb48ff6257555cd4@www.novabbs.org> <v4tfu3$1ostn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="510610"; mail-complaints-to="usenet@i2pn2.org";
posting-account="7opjq6o0gOhusEORo6KGlWDqrGdcQlz3IQ8pYKMWkuY";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Rslight-Site: $2y$10$Z7KFZN3eDp.aIXEKfhzbsuVIskjG1ydm823PTXkDTV4OOv.qTqYJC
X-Spam-Checker-Version: SpamAssassin 4.0.0
Bytes: 4558
Lines: 99
BGB wrote:
> On 6/18/2024 4:09 PM, MitchAlsup1 wrote:
>> BGB wrote:
>>
>>> On 6/13/2024 3:40 PM, MitchAlsup1 wrote:
>>> In this case, scheduling as-if it were an in-order core was leading to
>>> better performance than a more naive ordering (such as directly using
>>> the results of previous instructions or memory loads, vs shuffling
>>> other
>>>
>>> instructions in between them).
>>
>>> Either way, seemed to be different behavior than seen on either the
>>> Ryzen or on Intel Core based CPUs (where, seemingly, the CPU does not
>>> care about the relative order).
>>
>> Because it had no requirement of code scheduling, unlike 1st generation
>>
>> RISCs, so the cores were designed to put up good performance scores
>> without any code scheduling.
>>
> Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to
>
> instruction scheduling issues than either its predecessors (such as the
>
> Phenom II) and successors (Ryzen)?...
They "blew" the microarchitecture.
It was a 12-gate machine (down from 16-gates from Athlon). this puts
a "lot more stuff" on critical paths and some forwarding was not done,
particularly change in size between produced result and consumed
operand.
> Though, apparently "low IPC" was a noted issue with this processor
> family (apparently trying to gain higher clock-speeds at the expense of
>
> IPC; using a 20-stage pipeline, ...).
> Though, less obvious how having a longer pipeline than either its
> predecessors or successors would effect instruction scheduling.
>
>>
>> One of the things we found in Mc 88120 was that the compiler should
>> NEVER
>> be allowed to put unnecessary instructions in decode-execute slots that
>> were unused--and that almost invariable--the best code for the GBOoO
>> machine was almost invariably the one with the fewest instructions, and
>> if several sequences had equally few instructions, it basically did not
>> matter.
>>
>> For example::
>>
>> for( i = 0; i < max, i++ )
>> a[i] = b[i];
>>
>> was invariably faster than::
>>
>> for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
>> *ap++ = *bp++;
>>
>> because the later has 3 ADDs in the loop wile the former has but 1.
>> Because of this, I altered my programming style and almost never end up
>> using ++ or -- anymore.
> In this case, it would often be something more like:
> maxn4=max&(~3);
> for(i=0; i<maxn4; i+=4)
> {
> ap=a+i; bp=b+i;
> t0=ap[0]; t1=ap[1];
> t2=ap[2]; t3=ap[3];
> bp[0]=t0; bp[1]=t1;
> bp[2]=t2; bp[3]=t3;
> }
> if(max!=maxn4)
> {
> for(; i < max; i++ )
> a[i] = b[i];
> }
That is what VVM does, without you having to lift a finger.
> If things are partially or fully unrolled, they often go faster.
And ALWAYS eat more code space.
> Using a
> large number of local variables seems to be effective (even in cases
> where the number of local variables exceeds the number of CPU
> registers).
> Generally also using as few branches as possible.
> Etc...