Deutsch   English   Français   Italiano  
<v4f5de$2bfca$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: kegs@provalid.com (Kent Dickey)
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Thu, 13 Jun 2024 16:06:07 -0000 (UTC)
Organization: provalid.com
Lines: 160
Message-ID: <v4f5de$2bfca$1@dont-email.me>
References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me>
Injection-Date: Thu, 13 Jun 2024 18:06:07 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="d50e80dd393e33730100b3af1858d653";
	logging-data="2473354"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/9Kl8LUCyYkY94vXrdLQvH"
Cancel-Lock: sha1:WzkB+bnFHkwI9Dk8QlBkgDVFOp4=
X-Newsreader: trn 4.0-test76 (Apr 2, 2001)
Originator: kegs@provalid.com (Kent Dickey)
Bytes: 5156

In article <v04tpb$pqus$1@dont-email.me>,
Terje Mathisen  <terje.mathisen@tmsw.no> wrote:
>MitchAlsup1 wrote:
>> BGB wrote:
>> 
>>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
>>> Like, in-order superscalar isn't going to do crap if nearly every 
>>> instruction depends on every preceding instruction. Even pipelining 
>>> can't help much with this.
>> 
>> Pipelining CREATED this (back to back dependencies). No amount of
>> pipelining can eradicate RAW data dependencies.
>> 
>>> The compiler can shuffle the instructions into an order to limit the 
>>> number of register dependencies and better fit the pipeline. But, 
>>> then, most of the "hard parts" are already done (so it doesn't take 
>>> much more for the compiler to flag which instructions can run in 
>>> parallel).
>> 
>> Compiler scheduling works for exactly 1 pipeline implementation and
>> is suboptimal for all others.
>
>Well, yeah.
>
>OTOH, if your (definitely not my!) compiler can schedule a 4-wide static 
>ordering of operations, then it will be very nearly optimal on 2-wide 
>and 3-wide as well. (The difference is typically in a bit more loop 
>setup and cleanup code than needed.)
>
>Hand-optimizing Pentium asm code did teach me to "think like a cpu", 
>which is probably the only part of the experience which is still kind of 
>relevant. :-)
>
>Terje
>
>-- 
>- <Terje.Mathisen at tmsw.no>
>"almost all programming can be viewed as an exercise in caching"


This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.).  As an example, assume a perfectly
scheduled 4-wide sequence of instructions with the instructions labeled
with the group number, and letter A-D for the position in the group.
There is a dependency from A to A, B to B, etc., and a dependency from D
to A.  Here's what the instruction groupings look like on a 4-way machine:

INST0_A
INST0_B
INST0_C
INST0_D
-------
INST1_A
INST1_B
INST1_C
INST1_D
-------
INST2_A

There will obviously be other dependencies (say, INST2_A depends on INST0_B)
but they don't affect how this will be executed.
The ----- lines indicate group boundaries.  All instructions in a group
execute in the same cycle.  So the first 8 instruction take just 2 clocks
on a 4-wide.

If you run this sequence on a 3-wide, then the groupings will become:

INST0_A
INST0_B
INST0_C
-------
INST0_D
-------
INST1_A
INST1_B
INST1_C
-------
INST1_D
-------

What took 2 clocks on the 4-wide now takes 4 clocks on the 3-wide.  And
a different arrangement would take just 3 clocks:

INST0_A
INST0_B
INST0_D
-------
INST1_A
INST0_C
INST1_B
-------
INST1_C
INST1_D

-------------------------------

A similar problem occurs when the 4-wide is optimally scheduled, but doesn't
issue 4 instructions due to dependencies.  These dependencies can hit at
bad times for 2-wide causing it to not be optimal.  Here's a new 4-wide
sequence where INST1_A depends on INST0_C and INST0_A, and INST2_* all
depends on INST1_A, with this pattern repeating in even/odd groups.

INST0_A
INST0_B
INST0_C
-------
INST1_A
-------
INST2_A
INST2_B
INST2_C
-------
INST3_A
-------

This sequence takes 4 clocks on a 4-wide machine.

When run on a 2-wide machine, these are the cycle counts:

INST0_A
INST0_B
-------
INST0_C
-------
INST1_A
-------
INST2_A
INST2_B
-------
INST2_C
-------
INST3_A
-------

This takes 6 clocks.  But by moving INSTx_B, it could be faster:

INST0_A
INST0_C
-------
INST0_B
INST1_A
-------
INST2_A
INST2_C
-------
INST2_B
INST3_A
-------

Now it takes just 4 clocks.  So an optimal 4-wide schedule can be shown to
not be very non-optimal on 3-wide or 2-wide systems.  And this isn't taking
into account other delays and resource limits (like number of loads and
stores supported per cycle).

It's an interesting problem as to how bad it can get.  With resource
limits, I suspect it can be an integer multiple bad, but just using
register dependencies, I'm not sure how bad it can get.  I just showed 50%,
but I'm not sure if 100% slower is possible.

Kent