Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Stealing a Great Idea from the 6600
Date: Sun, 21 Apr 2024 22:59:12 -0500
Organization: A noiseless patient Spider
Lines: 160
Message-ID: <v04nai$ome3$1@dont-email.me>
References: <1s042jdli35gdo092v6uaupmrcmvo0i5vp@4ax.com>
 <oj742jdvpl21il2s5a1ndsp3oidsnfjmr6@4ax.com>
 <dd1866c4efb369b7b6cc499d718dc938@www.novabbs.org>
 <acq62j98dhmguil5ebce6lq4m9kkgt1fs2@4ax.com>
 <kkq62jppr53is4r70n151jl17bjd5kd6lv@4ax.com>
 <9d1fadaada2ec0683fc54688cce7cf27@www.novabbs.org>
 <v017mg$3rcg9$1@dont-email.me>
 <da6dc5fe28bb31b4c73d78ef1aac2ac5@www.novabbs.org>
 <v02eij$6d5b$1@dont-email.me>
 <152f8504112a37d8434c663e99cb36c5@www.novabbs.org>
 <e8eb2j1ftsikv6j4eeaksm8lkhc31fuipi@4ax.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 22 Apr 2024 05:59:16 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="7b1e3ac212388cea6886df46e04c8fee";
	logging-data="809411"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19Wu8NEarY4r0hyoBz5iAgitHde0R+T6Y8="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:s8/6vPSn0//4tpQ8a30o/vXkcU0=
In-Reply-To: <e8eb2j1ftsikv6j4eeaksm8lkhc31fuipi@4ax.com>
Content-Language: en-US
Bytes: 7882

On 4/21/2024 8:16 PM, John Savard wrote:
> On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1)
> wrote:
>> BGB wrote:
> 
>>> Like, in-order superscalar isn't going to do crap if nearly every
>>> instruction depends on every preceding instruction. Even pipelining
>>> can't help much with this.
> 
>> Pipelining CREATED this (back to back dependencies). No amount of
>> pipelining can eradicate RAW data dependencies.
> 
> This is quite true. However, in case an unsophisticated individual
> might read this thread, I think that I shall clarify.
> 
> Without pipelining, it is not a problem if each instruction depends on
> the one immediately previous, and so people got used to writing
> programs that way, as it was simple to write the code to do one thing
> before starting to write the code to begin doing another thing.
> 

Yeah.

This is also typical of naive compiler output, say:
   y=m*x+b;
Turns into RPN as, say:
   LD(m) LD(x) MUL LD(b) ADD ST(C)
Which, in a naive compiler (though, one with register allocation) may 
become, say:
   MULS R8, R9, R12
   ADD R12, R10, R13
   MOV R13, R11  //MUL output first goes into temporary

But, if MUL is 3c and ADD is 2c, this ends up needing 6 cycles.

The situation would be significantly worse in a compiler lacking 
register allocation (would add 8 memory operations to this; similar to 
what one gets with "gcc -O0").

For the most part, as can be noted, I was comparing against "gcc -O3" on 
the RV64 size, with "-ffuction-sections" and "-Wl,-gc-sections" and 
similar, as otherwise GCC's output is significantly larger. Though, 
nevermind the seemingly fairly bulky ELF metadata (PE/COFF is seemingly 
a bit more compact here). Can note that "-O3" vs "-Os" also doesn't seem 
to make that big of a difference for RV64.



If one has another expression, one can shuffle the operations between 
the expressions together, and the latency is lower than had no shuffling 
occurred; and if one can reduce dependencies enough, operations can be 
run in parallel for further gain. But, all this depends on first being 
able to shuffle things to break up the register-register dependencies 
between instructions.


In BGBCC, this part was done via the WEXifier, which imposes a lot of 
annoying restrictions (partly because it starts working after code 
generation has already taken place).

In size-optimized code, this doesn't happen, which results in a 
performance hit. This is partly since the WEXifier can only work with 
32-bit instructions, can't cross labels or relocs, and requires the 
register allocator to essentially round-robin the registers to minimize 
dependencies, ...

But, preferentially always allocating a new register and avoiding 
reusing registers within a basic block, while it reduces dependencies, 
also eats a lot more registers (with the indirect cost of increasing the 
number that need to be saved/restored, though the size-impact of this is 
reduced somewhat via prolog/epilog compression).


Though, one can shuffle stuff at the 3AC level (which exists in my case 
between the RPN and final code generation), but this is more hit-or-miss.

Better would have been to go from 3AC to a "virtual assembler", which 
could then allow reordering before emitting the actual machine code (and 
thus wouldn't be as restricted). This was originally considered, but 
ended up not going this way as it seemed like more work (in terms of 
internal restructuring) than to shove the logic in after the 
machine-code was generated.

But, the current compiler architecture was the result of always doing 
the most quick/dirty option at the time, which doesn't necessarily 
result in an optimal design.

Granted, OTOH, "waterfall method" doesn't really have the best 
track-record either (vs the "hack something together, hack on it some 
more, ..." method).


> This remained true when the simplest original form of pipelining was
> brought in - where fetching one instruction from memory was overlapped
> with decoding the previous instruction, and executing the instruction
> before that.
> 
> It's only when what was originally called "superpipelining" came
> along, where the execute stages of multiple successive instructions
> could be overlapped, that it was necessary to do something about
> dependencies in order to take advantage of the speedup that could
> provide.
> 

Yeah.

Pipeline:
   PF:
     PC arrives at I$
       Selected from:
         If Branch: Branch-PC
         Else, if Branch-Predicted, Branch Pred Result
         Else, LastPC+PCStep
   IF:
     Fetches 96 bits at PC
     Figures how much to advance PC;
     Figures out if we can do superscalar here.
       Check for register clashes;
       Check for valid prefix and suffix;
       If both checks pass, go for it.
   ID:
     Unpack instruction words;
     Pipeline now splits into 3 lanes;
     Branch predictor does its thing.
   ID2/RF:
     Results come in from the registers;
     Figure out if current bundle can enter EX stages;
     Figure out if each predicated instruction should execute.
   EX1(EX1C|EX1B|EX1A):
     Do stuff: ALU, Initiate memory access, ...
   EX2(EX2C|EX2B|EX2A):
     Do stuff | results arrive.
   EX3(EX3C|EX3B|EX3A):
     Results arrive;
     Produce any final results.
   WB:
     Results are written into register file.

By EX1, it is known whether or not the branch will actually be taken, so 
(if needed) it may override the former guess of the branch-predictor. By 
EX2, the branch-initiation takes effect, and by EX3, the new PC reaches 
the I$ (overriding whether else would have normally arrived).


In a few cases (such as a jump between ISA modes), extra cycles may be 
needed to make sure everything is caught up (so, the same PC address is 
held on the I$ input for around 3 cycles in this case).

This may happen if, say:
   Jumping between Baseline, XG2, or RISC-V;
   WEXMD changing whether WEX decoding is enabled or disabled;
     If disabled, it behaves as if the WEX'ed instructions were scalar;
     Jumbo prefixes ignore this (always behaving as-if it were enabled);
   ...
Mostly to make sure that IF and ID can decode the instructions as is 
correct for the mode in question.


> John Savard