Deutsch English Français Italiano |
<v054gb$r679$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Stealing a Great Idea from the 6600 Date: Mon, 22 Apr 2024 02:44:09 -0500 Organization: A noiseless patient Spider Lines: 69 Message-ID: <v054gb$r679$1@dont-email.me> References: <lge02j554ucc6h81n5q2ej0ue2icnnp7i5@4ax.com> <e2097beb24bf27eed0a92f14596bd59e@www.novabbs.org> <in312jlca131khq3vj0i24n6pb0hah2ur5@4ax.com> <71acfecad198c4e9a9b14ffab7fc1cb5@www.novabbs.org> <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com> <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org> <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com> <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <v017mg$3rcg9$1@dont-email.me> <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org> <v02eij$6d5b$1@dont-email.me> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> <v04tpb$pqus$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 22 Apr 2024 09:44:11 +0200 (CEST) Injection-Info: dont-email.me; posting-host="7b1e3ac212388cea6886df46e04c8fee"; logging-data="891113"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18qqVe6hmnslVneJ+RZBW9acL+8yWPd1mA=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:3gTeugqaUUVegIUUX07vhnfkiAA= In-Reply-To: <v04tpb$pqus$1@dont-email.me> Content-Language: en-US Bytes: 4480 On 4/22/2024 12:49 AM, Terje Mathisen wrote: > MitchAlsup1 wrote: >> BGB wrote: >> >>> On 4/20/2024 5:03 PM, MitchAlsup1 wrote: >>> Like, in-order superscalar isn't going to do crap if nearly every >>> instruction depends on every preceding instruction. Even pipelining >>> can't help much with this. >> >> Pipelining CREATED this (back to back dependencies). No amount of >> pipelining can eradicate RAW data dependencies. >> >>> The compiler can shuffle the instructions into an order to limit the >>> number of register dependencies and better fit the pipeline. But, >>> then, most of the "hard parts" are already done (so it doesn't take >>> much more for the compiler to flag which instructions can run in >>> parallel). >> >> Compiler scheduling works for exactly 1 pipeline implementation and >> is suboptimal for all others. > > Well, yeah. > > OTOH, if your (definitely not my!) compiler can schedule a 4-wide static > ordering of operations, then it will be very nearly optimal on 2-wide > and 3-wide as well. (The difference is typically in a bit more loop > setup and cleanup code than needed.) > > Hand-optimizing Pentium asm code did teach me to "think like a cpu", > which is probably the only part of the experience which is still kind of > relevant. :-) > Mine is hard-pressed to even make effective use of the current pipeline, so going wider does not make sense at present. As I had noted before, the main merit of 3 wide in my case is that it makes it easier to justify a 6R register file, which, unlike the 4R register file, doesn't choke up with trying to run other instructions in parallel with memory store and similar (which is actually a fairly serious restriction given how much memory operations tend to clog up Lane 1; opportunities for "ALU|ST" being more common than "ALU|ALU"). Granted, one could argue that (Reg, Disp) memory addressing could be supported entirely within a 2R1W pattern, which while true in premise, does not match my implementation (which always uses indexed addressing internally, treating the Disp as a virtual register; thus eating 3 register ports). Well, and for the 4R2W configuration, the main priority is minimizing LUT cost (which favors leaving it as-is, with the current restrictions). Granted, some similar issues apply to 128-bit MOV.X and SIMD ops, which as-is can only exist as scalar ops. These could potentially also be hacked around (say, to allow ALU|SIMD or ALU|MOV.X, but the "fix" would cost a lot of LUTs). Mostly in that variability in terms of input routing does not come cheap. Though, that said, the 3rd lane still gets used for a share of basic ALU instructions, so isn't entirely going to waste either. > Terje >