| Deutsch English Français Italiano |
|
<v4h23r$2qt1u$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen <terje.mathisen@tmsw.no> Newsgroups: comp.arch Subject: Re: Stealing a Great Idea from the 6600 Date: Fri, 14 Jun 2024 11:22:02 +0200 Organization: A noiseless patient Spider Lines: 136 Message-ID: <v4h23r$2qt1u$1@dont-email.me> References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me> <v4f5de$2bfca$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 14 Jun 2024 11:22:03 +0200 (CEST) Injection-Info: dont-email.me; posting-host="fdf5fff8c2d1debad183bcd0e3d496af"; logging-data="2978878"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/ZTuGfxkTbw7W3iafyWtDHwFBq8qebwQgbHQA8/dzIWg==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2 Cancel-Lock: sha1:YI26rp2FZMdvqbZ144Cs93hE7zk= In-Reply-To: <v4f5de$2bfca$1@dont-email.me> Bytes: 5012 Kent Dickey wrote: > In article <v04tpb$pqus$1@dont-email.me>, > Terje Mathisen <terje.mathisen@tmsw.no> wrote: >> MitchAlsup1 wrote: >>> BGB wrote: >>> >>>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote: >>>> Like, in-order superscalar isn't going to do crap if nearly every >>>> instruction depends on every preceding instruction. Even pipelining >>>> can't help much with this. >>> >>> Pipelining CREATED this (back to back dependencies). No amount of >>> pipelining can eradicate RAW data dependencies. >>> >>>> The compiler can shuffle the instructions into an order to limit the >>>> number of register dependencies and better fit the pipeline. But, >>>> then, most of the "hard parts" are already done (so it doesn't take >>>> much more for the compiler to flag which instructions can run in >>>> parallel). >>> >>> Compiler scheduling works for exactly 1 pipeline implementation and >>> is suboptimal for all others. >> >> Well, yeah. >> >> OTOH, if your (definitely not my!) compiler can schedule a 4-wide static >> ordering of operations, then it will be very nearly optimal on 2-wide >> and 3-wide as well. (The difference is typically in a bit more loop >> setup and cleanup code than needed.) >> >> Hand-optimizing Pentium asm code did teach me to "think like a cpu", >> which is probably the only part of the experience which is still kind of >> relevant. :-) >> >> Terje >> >> -- >> - <Terje.Mathisen at tmsw.no> >> "almost all programming can be viewed as an exercise in caching" > > > This is a late reply, but optimal static ordering for N-wide may be > very non-optimal for N-1 (or N-2, etc.). As an example, assume a perfectly > scheduled 4-wide sequence of instructions with the instructions labeled > with the group number, and letter A-D for the position in the group. > There is a dependency from A to A, B to B, etc., and a dependency from D > to A. Here's what the instruction groupings look like on a 4-way machine: > > INST0_A > INST0_B > INST0_C > INST0_D > ------- > INST1_A > INST1_B > INST1_C > INST1_D > ------- > INST2_A > > There will obviously be other dependencies (say, INST2_A depends on INST0_B) > but they don't affect how this will be executed. > The ----- lines indicate group boundaries. All instructions in a group > execute in the same cycle. So the first 8 instruction take just 2 clocks > on a 4-wide. > > If you run this sequence on a 3-wide, then the groupings will become: > > INST0_A > INST0_B > INST0_C > ------- > INST0_D > ------- > INST1_A > INST1_B > INST1_C > ------- > INST1_D > ------- OK, you did state that A1 depends on D0, but then showed a bit later that neither A nor D depended on C, so you could use that as a filler. > INST0_A > INST0_B > INST0_D > ------- > INST1_A > INST0_C > INST1_B > ------- > INST1_C > INST1_D Obviously you cannot follow this up with INST2_A, you would need INST2B here and then A/C/D on the next cycle INST2_B ------- INST2_A INST2_C INST2_D at which point the pattern could repeat itself. Running this slightly modified ordering on a 4-wide would again fail, but if I instead write it like this: INST0_A INST0_B INST0_D -------- INST0_C INST1_A INST1_B -------- INST1_D INST1_C INST2_B -------- INST2_A INST2_C INST2_D then a re-grouping for the 4-wide would still have one instruction from each ABCD group in each cycle and A would never stall waiting for a previous D in the same cycle. This is probably close to the patterns an OoO 3 or 4-wide would settle down on after a bunch of iterations. Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"