Article <v4h23r$2qt1u$1@dont-email.me>

Deutsch English Français Italiano
<v4h23r$2qt1u$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Fri, 14 Jun 2024 11:22:02 +0200
Organization: A noiseless patient Spider
Lines: 136
Message-ID: <v4h23r$2qt1u$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com>
 <v02eij$6d5b$1@dont-email.me>
 <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
 <v04tpb$pqus$1@dont-email.me> <v4f5de$2bfca$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 14 Jun 2024 11:22:03 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fdf5fff8c2d1debad183bcd0e3d496af";
	logging-data="2978878"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/ZTuGfxkTbw7W3iafyWtDHwFBq8qebwQgbHQA8/dzIWg=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101
 Firefox/91.0 SeaMonkey/2.53.18.2
Cancel-Lock: sha1:YI26rp2FZMdvqbZ144Cs93hE7zk=
In-Reply-To: <v4f5de$2bfca$1@dont-email.me>
Bytes: 5012

Kent Dickey wrote:
> In article <v04tpb$pqus$1@dont-email.me>,
> Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>> MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
>>>> Like, in-order superscalar isn't going to do crap if nearly every
>>>> instruction depends on every preceding instruction. Even pipelining
>>>> can't help much with this.
>>>
>>> Pipelining CREATED this (back to back dependencies). No amount of
>>> pipelining can eradicate RAW data dependencies.
>>>
>>>> The compiler can shuffle the instructions into an order to limit the
>>>> number of register dependencies and better fit the pipeline. But,
>>>> then, most of the "hard parts" are already done (so it doesn't take
>>>> much more for the compiler to flag which instructions can run in
>>>> parallel).
>>>
>>> Compiler scheduling works for exactly 1 pipeline implementation and
>>> is suboptimal for all others.
>>
>> Well, yeah.
>>
>> OTOH, if your (definitely not my!) compiler can schedule a 4-wide static
>> ordering of operations, then it will be very nearly optimal on 2-wide
>> and 3-wide as well. (The difference is typically in a bit more loop
>> setup and cleanup code than needed.)
>>
>> Hand-optimizing Pentium asm code did teach me to "think like a cpu",
>> which is probably the only part of the experience which is still kind of
>> relevant. :-)
>>
>> Terje
>>
>> -- 
>> - <Terje.Mathisen at tmsw.no>
>> "almost all programming can be viewed as an exercise in caching"
> 
> 
> This is a late reply, but optimal static ordering for N-wide may be
> very non-optimal for N-1 (or N-2, etc.).  As an example, assume a perfectly
> scheduled 4-wide sequence of instructions with the instructions labeled
> with the group number, and letter A-D for the position in the group.
> There is a dependency from A to A, B to B, etc., and a dependency from D
> to A.  Here's what the instruction groupings look like on a 4-way machine:
> 
> INST0_A
> INST0_B
> INST0_C
> INST0_D
> -------
> INST1_A
> INST1_B
> INST1_C
> INST1_D
> -------
> INST2_A
> 
> There will obviously be other dependencies (say, INST2_A depends on INST0_B)
> but they don't affect how this will be executed.
> The ----- lines indicate group boundaries.  All instructions in a group
> execute in the same cycle.  So the first 8 instruction take just 2 clocks
> on a 4-wide.
> 
> If you run this sequence on a 3-wide, then the groupings will become:
> 
> INST0_A
> INST0_B
> INST0_C
> -------
> INST0_D
> -------
> INST1_A
> INST1_B
> INST1_C
> -------
> INST1_D
> -------

OK, you did state that A1 depends on D0, but then showed a bit later 
that neither A nor D depended on C, so you could use that as a filler.

> INST0_A
> INST0_B
> INST0_D
> -------
> INST1_A
> INST0_C
> INST1_B
> -------
> INST1_C
> INST1_D

Obviously you cannot follow this up with INST2_A, you would need INST2B 
here and then A/C/D on the next cycle

  INST2_B
  -------
  INST2_A
  INST2_C
  INST2_D

at which point the pattern could repeat itself.

Running this slightly modified ordering on a 4-wide would again fail, 
but if I instead write it like this:

  INST0_A
  INST0_B
  INST0_D
--------
  INST0_C
  INST1_A
  INST1_B
--------
  INST1_D
  INST1_C
  INST2_B
--------
  INST2_A
  INST2_C
  INST2_D

then a re-grouping for the 4-wide would still have one instruction from 
each ABCD group in each cycle and A would never stall waiting for a 
previous D in the same cycle.

This is probably  close to the patterns an OoO 3 or 4-wide would settle 
down on after a bunch of iterations.

Terje
-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"