Article <v4v3ot$22rd9$1@dont-email.me>

Deutsch English Français Italiano
<v4v3ot$22rd9$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Wed, 19 Jun 2024 12:16:11 -0500
Organization: A noiseless patient Spider
Lines: 192
Message-ID: <v4v3ot$22rd9$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
 <v02eij$6d5b$1@dont-email.me>
 <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
 <v04tpb$pqus$1@dont-email.me> <v4f5de$2bfca$1@dont-email.me>
 <jwvzfrobxll.fsf-monnier+comp.arch@gnu.org> <v4f97o$2bu2l$1@dont-email.me>
 <613b9cb1a19b6439266f520e94e2046b@www.novabbs.org>
 <v4hsjk$2vk6n$1@dont-email.me>
 <6b5691e5e41d28d6cb48ff6257555cd4@www.novabbs.org>
 <v4tfu3$1ostn$1@dont-email.me>
 <96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 19 Jun 2024 19:16:14 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="1b230587210f6f877000eb5e9d42f72f";
	logging-data="2190761"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX186Ahsa/hqNwYIEB6YFGdXovoUDzwCDI98="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:F2nW59nAqCf6ZqCjpuwjmEwnW7I=
Content-Language: en-US
In-Reply-To: <96280554541a8a9b1a29a5cbd5b7c07b@www.novabbs.org>
Bytes: 8580

On 6/19/2024 11:11 AM, MitchAlsup1 wrote:
> BGB wrote:
> 
>> On 6/18/2024 4:09 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 6/13/2024 3:40 PM, MitchAlsup1 wrote:
> 
>>>> In this case, scheduling as-if it were an in-order core was leading 
>>>> to better performance than a more naive ordering (such as directly 
>>>> using the results of previous instructions or memory loads, vs 
>>>> shuffling
>>>> other
>>>>
>>>> instructions in between them).
>>>
>>>> Either way, seemed to be different behavior than seen on either the 
>>>> Ryzen or on Intel Core based CPUs (where, seemingly, the CPU does 
>>>> not care about the relative order).
>>>
>>> Because it had no requirement of code scheduling, unlike 1st generation
>>>
>>> RISCs, so the cores were designed to put up good performance scores 
>>> without any code scheduling.
>>>
> 
>> Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to
>>
>> instruction scheduling issues than either its predecessors (such as the
>>
>> Phenom II) and successors (Ryzen)?...
> 
> They "blew" the microarchitecture.
> 
> It was a 12-gate machine (down from 16-gates from Athlon). this puts a 
> "lot more stuff" on critical paths and some forwarding was not done,
> particularly change in size between produced result and consumed
> operand.
> 

OK.

The stuff I could find didn't mention any of this...


More just a lot of mention of "low IPC" with not really much other 
clarification or context given.


>> Though, apparently "low IPC" was a noted issue with this processor 
>> family (apparently trying to gain higher clock-speeds at the expense of
>>
>> IPC; using a 20-stage pipeline, ...).
> 
>> Though, less obvious how having a longer pipeline than either its 
>> predecessors or successors would effect instruction scheduling.
> 
>>
>>>
>>> One of the things we found in Mc 88120 was that the compiler should
>>> NEVER
>>> be allowed to put unnecessary instructions in decode-execute slots that
>>> were unused--and that almost invariable--the best code for the GBOoO 
>>> machine was almost invariably the one with the fewest instructions, and
>>> if several sequences had equally few instructions, it basically did not
>>> matter.
>>>
>>> For example::
>>>
>>>      for( i = 0; i < max, i++ )
>>>           a[i] = b[i];
>>>
>>> was invariably faster than::
>>>
>>>      for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
>>>           *ap++ = *bp++;
>>>
>>> because the later has 3 ADDs in the loop wile the former has but 1.
>>> Because of this, I altered my programming style and almost never end up
>>> using ++ or -- anymore.
> 
> 
> 
>> In this case, it would often be something more like:
>>    maxn4=max&(~3);
>>    for(i=0; i<maxn4; i+=4)
>>    {
>>      ap=a+i;    bp=b+i;
>>      t0=ap[0];  t1=ap[1];
>>      t2=ap[2];  t3=ap[3];
>>      bp[0]=t0;  bp[1]=t1;
>>      bp[2]=t2;  bp[3]=t3;
>>    }
>>    if(max!=maxn4)
>>    {
>>      for(; i < max; i++ )
>>        a[i] = b[i];
>>    }
> 
> That is what VVM does, without you having to lift a finger.
> 
>> If things are partially or fully unrolled, they often go faster.
> 
> And ALWAYS eat more code space.
> 

Granted, but it is faster in this case, though mostly due to being able 
to sidestep some of the interlock penalties and reducing the amount of 
cycles spent on the loop itself.

Say, since branching isn't free, more so if one does an increment 
directly before checking the condition and branching, as is typically 
the case in a "for()" loop.

Then, there is also the issue of needing to check the condition on entry 
to the loop body, which is often not necessary on the first iteration, 
but (except in very narrow cases) is not easily optimized away.


>>                                                                 Using a
>> large number of local variables seems to be effective (even in cases 
>> where the number of local variables exceeds the number of CPU
>> registers).
> 
>> Generally also using as few branches as possible.
>> Etc...

Though, can note that BJX2 does have an advantage over x86-64 here in that:
   if(cond)
     { simple-one-liner }
Will typically be turned into predicated instructions.

On Piledriver (and to a lesser extent on K10/Phenom), this was a big 
pain case, but on Ryzen this seems to have become less of an issue as well.

It is almost a mystery if they are now using the "implicitly turn short 
forward branches into predication" trick.


I suspect was a big factor for LZ4 decoding performance, since the 
handling of match and literal lengths requires multiple back-to-back 
branch instructions and how many bytes are consumed also depends on this.

Though, an LZ4 variant with lengths limited to a single byte (and an 
alteration to allow for encoding no-match and EOB cases) could gain a 
little speed here by allowing the length handling to be turned into 
predication. But, mostly ended up not using it.


Contrast with my RP2 scheme which moves more of this handling to being 
up-front (and needs fewer memory accesses to decode a match).

Though, does lead to wonk:
RP2 is a little faster and also often gets slightly better compression 
on BJX2;
LZ4 is faster than RP2 on my Ryzen (but was slower on Piledriver).

Though, can note that compression depends on what is being compressed:
   Text or other similarly structured data:
     RP2 tends to win;
   Executable code and sparse data structures, LZ4 tends to win.

Both run circles around Deflate or other entropy-coded designs though in 
terms of decode speed.


Though, for things like structure and RAM compression, a format based on 
aligned DWORDs or QWORDs can work acceptably (but is pretty much 
entirely ineffective against text or other byte-serialized data 
formats). Though, this is mostly because RAM and structure-based data 
========== REMAINDER OF ARTICLE TRUNCATED ==========