Deutsch English Français Italiano |
<v4f5de$2bfca$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: kegs@provalid.com (Kent Dickey) Newsgroups: comp.arch Subject: Re: Stealing a Great Idea from the 6600 Date: Thu, 13 Jun 2024 16:06:07 -0000 (UTC) Organization: provalid.com Lines: 160 Message-ID: <v4f5de$2bfca$1@dont-email.me> References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me> Injection-Date: Thu, 13 Jun 2024 18:06:07 +0200 (CEST) Injection-Info: dont-email.me; posting-host="d50e80dd393e33730100b3af1858d653"; logging-data="2473354"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/9Kl8LUCyYkY94vXrdLQvH" Cancel-Lock: sha1:WzkB+bnFHkwI9Dk8QlBkgDVFOp4= X-Newsreader: trn 4.0-test76 (Apr 2, 2001) Originator: kegs@provalid.com (Kent Dickey) Bytes: 5156 In article <v04tpb$pqus$1@dont-email.me>, Terje Mathisen <terje.mathisen@tmsw.no> wrote: >MitchAlsup1 wrote: >> BGB wrote: >> >>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote: >>> Like, in-order superscalar isn't going to do crap if nearly every >>> instruction depends on every preceding instruction. Even pipelining >>> can't help much with this. >> >> Pipelining CREATED this (back to back dependencies). No amount of >> pipelining can eradicate RAW data dependencies. >> >>> The compiler can shuffle the instructions into an order to limit the >>> number of register dependencies and better fit the pipeline. But, >>> then, most of the "hard parts" are already done (so it doesn't take >>> much more for the compiler to flag which instructions can run in >>> parallel). >> >> Compiler scheduling works for exactly 1 pipeline implementation and >> is suboptimal for all others. > >Well, yeah. > >OTOH, if your (definitely not my!) compiler can schedule a 4-wide static >ordering of operations, then it will be very nearly optimal on 2-wide >and 3-wide as well. (The difference is typically in a bit more loop >setup and cleanup code than needed.) > >Hand-optimizing Pentium asm code did teach me to "think like a cpu", >which is probably the only part of the experience which is still kind of >relevant. :-) > >Terje > >-- >- <Terje.Mathisen at tmsw.no> >"almost all programming can be viewed as an exercise in caching" This is a late reply, but optimal static ordering for N-wide may be very non-optimal for N-1 (or N-2, etc.). As an example, assume a perfectly scheduled 4-wide sequence of instructions with the instructions labeled with the group number, and letter A-D for the position in the group. There is a dependency from A to A, B to B, etc., and a dependency from D to A. Here's what the instruction groupings look like on a 4-way machine: INST0_A INST0_B INST0_C INST0_D ------- INST1_A INST1_B INST1_C INST1_D ------- INST2_A There will obviously be other dependencies (say, INST2_A depends on INST0_B) but they don't affect how this will be executed. The ----- lines indicate group boundaries. All instructions in a group execute in the same cycle. So the first 8 instruction take just 2 clocks on a 4-wide. If you run this sequence on a 3-wide, then the groupings will become: INST0_A INST0_B INST0_C ------- INST0_D ------- INST1_A INST1_B INST1_C ------- INST1_D ------- What took 2 clocks on the 4-wide now takes 4 clocks on the 3-wide. And a different arrangement would take just 3 clocks: INST0_A INST0_B INST0_D ------- INST1_A INST0_C INST1_B ------- INST1_C INST1_D ------------------------------- A similar problem occurs when the 4-wide is optimally scheduled, but doesn't issue 4 instructions due to dependencies. These dependencies can hit at bad times for 2-wide causing it to not be optimal. Here's a new 4-wide sequence where INST1_A depends on INST0_C and INST0_A, and INST2_* all depends on INST1_A, with this pattern repeating in even/odd groups. INST0_A INST0_B INST0_C ------- INST1_A ------- INST2_A INST2_B INST2_C ------- INST3_A ------- This sequence takes 4 clocks on a 4-wide machine. When run on a 2-wide machine, these are the cycle counts: INST0_A INST0_B ------- INST0_C ------- INST1_A ------- INST2_A INST2_B ------- INST2_C ------- INST3_A ------- This takes 6 clocks. But by moving INSTx_B, it could be faster: INST0_A INST0_C ------- INST0_B INST1_A ------- INST2_A INST2_C ------- INST2_B INST3_A ------- Now it takes just 4 clocks. So an optimal 4-wide schedule can be shown to not be very non-optimal on 3-wide or 2-wide systems. And this isn't taking into account other delays and resource limits (like number of loads and stores supported per cycle). It's an interesting problem as to how bad it can get. With resource limits, I suspect it can be an integer multiple bad, but just using register dependencies, I'm not sure how bad it can get. I just showed 50%, but I'm not sure if 100% slower is possible. Kent