Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Stealing a Great Idea from the 6600 Date: Sun, 21 Apr 2024 22:59:12 -0500 Organization: A noiseless patient Spider Lines: 160 Message-ID: References: <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com> <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org> <152f8504112a37d8434c663e99cb36c5@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 22 Apr 2024 05:59:16 +0200 (CEST) Injection-Info: dont-email.me; posting-host="7b1e3ac212388cea6886df46e04c8fee"; logging-data="809411"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19Wu8NEarY4r0hyoBz5iAgitHde0R+T6Y8=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:s8/6vPSn0//4tpQ8a30o/vXkcU0= In-Reply-To: Content-Language: en-US Bytes: 7882 On 4/21/2024 8:16 PM, John Savard wrote: > On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1) > wrote: >> BGB wrote: > >>> Like, in-order superscalar isn't going to do crap if nearly every >>> instruction depends on every preceding instruction. Even pipelining >>> can't help much with this. > >> Pipelining CREATED this (back to back dependencies). No amount of >> pipelining can eradicate RAW data dependencies. > > This is quite true. However, in case an unsophisticated individual > might read this thread, I think that I shall clarify. > > Without pipelining, it is not a problem if each instruction depends on > the one immediately previous, and so people got used to writing > programs that way, as it was simple to write the code to do one thing > before starting to write the code to begin doing another thing. > Yeah. This is also typical of naive compiler output, say: y=m*x+b; Turns into RPN as, say: LD(m) LD(x) MUL LD(b) ADD ST(C) Which, in a naive compiler (though, one with register allocation) may become, say: MULS R8, R9, R12 ADD R12, R10, R13 MOV R13, R11 //MUL output first goes into temporary But, if MUL is 3c and ADD is 2c, this ends up needing 6 cycles. The situation would be significantly worse in a compiler lacking register allocation (would add 8 memory operations to this; similar to what one gets with "gcc -O0"). For the most part, as can be noted, I was comparing against "gcc -O3" on the RV64 size, with "-ffuction-sections" and "-Wl,-gc-sections" and similar, as otherwise GCC's output is significantly larger. Though, nevermind the seemingly fairly bulky ELF metadata (PE/COFF is seemingly a bit more compact here). Can note that "-O3" vs "-Os" also doesn't seem to make that big of a difference for RV64. If one has another expression, one can shuffle the operations between the expressions together, and the latency is lower than had no shuffling occurred; and if one can reduce dependencies enough, operations can be run in parallel for further gain. But, all this depends on first being able to shuffle things to break up the register-register dependencies between instructions. In BGBCC, this part was done via the WEXifier, which imposes a lot of annoying restrictions (partly because it starts working after code generation has already taken place). In size-optimized code, this doesn't happen, which results in a performance hit. This is partly since the WEXifier can only work with 32-bit instructions, can't cross labels or relocs, and requires the register allocator to essentially round-robin the registers to minimize dependencies, ... But, preferentially always allocating a new register and avoiding reusing registers within a basic block, while it reduces dependencies, also eats a lot more registers (with the indirect cost of increasing the number that need to be saved/restored, though the size-impact of this is reduced somewhat via prolog/epilog compression). Though, one can shuffle stuff at the 3AC level (which exists in my case between the RPN and final code generation), but this is more hit-or-miss. Better would have been to go from 3AC to a "virtual assembler", which could then allow reordering before emitting the actual machine code (and thus wouldn't be as restricted). This was originally considered, but ended up not going this way as it seemed like more work (in terms of internal restructuring) than to shove the logic in after the machine-code was generated. But, the current compiler architecture was the result of always doing the most quick/dirty option at the time, which doesn't necessarily result in an optimal design. Granted, OTOH, "waterfall method" doesn't really have the best track-record either (vs the "hack something together, hack on it some more, ..." method). > This remained true when the simplest original form of pipelining was > brought in - where fetching one instruction from memory was overlapped > with decoding the previous instruction, and executing the instruction > before that. > > It's only when what was originally called "superpipelining" came > along, where the execute stages of multiple successive instructions > could be overlapped, that it was necessary to do something about > dependencies in order to take advantage of the speedup that could > provide. > Yeah. Pipeline: PF: PC arrives at I$ Selected from: If Branch: Branch-PC Else, if Branch-Predicted, Branch Pred Result Else, LastPC+PCStep IF: Fetches 96 bits at PC Figures how much to advance PC; Figures out if we can do superscalar here. Check for register clashes; Check for valid prefix and suffix; If both checks pass, go for it. ID: Unpack instruction words; Pipeline now splits into 3 lanes; Branch predictor does its thing. ID2/RF: Results come in from the registers; Figure out if current bundle can enter EX stages; Figure out if each predicated instruction should execute. EX1(EX1C|EX1B|EX1A): Do stuff: ALU, Initiate memory access, ... EX2(EX2C|EX2B|EX2A): Do stuff | results arrive. EX3(EX3C|EX3B|EX3A): Results arrive; Produce any final results. WB: Results are written into register file. By EX1, it is known whether or not the branch will actually be taken, so (if needed) it may override the former guess of the branch-predictor. By EX2, the branch-initiation takes effect, and by EX3, the new PC reaches the I$ (overriding whether else would have normally arrived). In a few cases (such as a jump between ISA modes), extra cycles may be needed to make sure everything is caught up (so, the same PC address is held on the I$ input for around 3 cycles in this case). This may happen if, say: Jumping between Baseline, XG2, or RISC-V; WEXMD changing whether WEX decoding is enabled or disabled; If disabled, it behaves as if the WEX'ed instructions were scalar; Jumbo prefixes ignore this (always behaving as-if it were enabled); ... Mostly to make sure that IF and ID can decode the instructions as is correct for the mode in question. > John Savard