| Deutsch English Français Italiano |
|
<vcpvhs$2bgj0$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Paul A. Clayton" <paaronclayton@gmail.com> Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Sun, 22 Sep 2024 16:43:38 -0400 Organization: A noiseless patient Spider Lines: 155 Message-ID: <vcpvhs$2bgj0$1@dont-email.me> References: <vaqgtl$3526$1@dont-email.me> <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com> <2024Sep10.101932@mips.complang.tuwien.ac.at> <ygn8qvztf16.fsf@y.z> <2024Sep11.123824@mips.complang.tuwien.ac.at> <vbsoro$3ol1a$1@dont-email.me> <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com> <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com> <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me> <OvEFO.70694$EEm7.38286@fx16.iad> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <vc7a6h$2afrl$2@dont-email.me> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad> <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad> <vcffub$77jk$1@dont-email.me> <n7XGO.89096$15a6.87061@fx12.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 22 Sep 2024 22:43:40 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5645eae8204df80f973f25404ee2db0b"; logging-data="2474592"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+wPEccKAsFPL85V25OMfdZOepeI7Ir3ag=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0 Cancel-Lock: sha1:1FiRSbUalGszDGmGT7BAu3Atu2c= In-Reply-To: <n7XGO.89096$15a6.87061@fx12.iad> Bytes: 8508 On 9/19/24 11:07 AM, EricP wrote: [snip] > If the multiplier is pipelined with a latency of 5 and throughput > of 1, > then MULL takes 5 cycles and MULL,MULH takes 6. > > But those two multiplies still are tossing away 50% of their work. I do not remember how multipliers are actually implemented — and am not motivated to refresh my memory at the moment — but I thought a multiply low would not need to generate the upper bits, so I do not understand where your "50% of their work" is coming from. The high result needs the low result carry-out but not the rest of the result. (An approximate multiply high for multiply by reciprocal might be useful, avoiding the low result work. There might also be ways that a multiplier could be configured to also provide bit mixing similar to middle result for generating a hash?) I seem to recall a PowerPC implementation did semi-pipelined 32- bit multiplication 16-bits at a time. This presumably saved area and power while also facilitating early out for small multiplicands, at the cost of some latency and substantial throughput compared to a fully pipelined multiplication. If I remember correctly, this produced a result for 16-bit by 32-bit multiplication, which is different from generating a low or high result. > And if it does fuse them then the internal uArch cost is the same > as if > you had designed it optimally from the start, except now you have > to pay for a fuser. > > <sound of soap box being dragged out> > This idea that macro-op fusion is some magic solution is bullshit. > 1) It's not free. Neither is increasing the number of opcodes or providing extender prefixes. If one wants binary compatibility, non-fusing implementations would work. (I tend to favor providing a translation layer between software distribution format and instruction cache format, which reduces the binary compatibility constraint.) > 2) It only works where Decode can see *all* the required lookahead > instructions, which means you have to pay for an N-lane decoder > but only get 1 lane. Most fusion is for two adjacent instructions, which significantly limits the complexity. The fusable patterns are also a subset of all pairs of two instructions, so complete two-way decoding may not be needed. There may also be optimization opportunities from looking ahead. Mitch Alsup proposed such for branch handling in a scalar implementation. Apart from fusion, there might be advantages for avoiding bank conflicts in a banked register file. I.e., the cost of lookahead might be shared by multiple techniques/optimizations. I tend to agree that fusion tends to be a workaround for sub- optimal instruction encoding, but it seems that encoding involves a lot of tradeoffs. > 3) It's probabilistic as it depends on how the fetch buffers get > loaded. > Eg if the fetch buffer contains a valid instruction but does > not have > a next instruction, do you stall Decode to see if a fuser > might arrive > or dispatch it anyway. This is also somewhat true for variable length encodings that cross fetch boundaries. In general a boundary-crossing instruction would probably stall even if such was not strictly necessary (e.g., if the missing information is opcode refinement — not related to instruction routing — or an immediate or even a register source identifier specifying a value that can have delayed use (e.g., value of a store, addend of a FMADD). This does seem a weakness, but fusion is not entirely negative factors. > 4) It gets exponentially expensive if you start doing multiple > instruction > lanes because decode has to deal with all the permutations of > fusion possibilities. This is also a factor in mere superscalar decode/execute. Detecting that an instruction is dependent on another would normally stall the execution of that instruction. (I feel that encoding some of the dependency information could be useful to avoid some of this work. In theory, common dependency detection could also be more broadly useful; e.g., operand availability detection and execution/operand routing.) > 5) Any fused instructions leave (multiple) bubbles that should be > compacted out or there wasn't much point to doing the fusion. Even with reduced operations per cycle, fusion could still provide a net energy benefit. > In my opinion it is better to have an ISA that is optimal by design > rather than being patched up by fusion later. Fusion is mostly presented for "patching up", but there are also considerations of diverse microarchitectures. With pre-fused instructions, an implementation might need to crack some of those instructions. Software optimized for such an implementation might also prefer more flexible compile-time scheduling of pre-cracked operations. A load-op instruction is perhaps particularly difficult because one needs frequent stalls, a skewed (or second chance) pipeline to hide the load latency, out-of-order execution, or some other stall avoidance mechanism. There are also constraints in encoding granularity. > Some of this inefficiency is caused by clinging to now 40 year old > risc design *guidelines* (ie not even rules) that: > - instructions have at most 1 dest and 2 source registers FMADD seems to have mostly killed the 2-source limit. AArch64's paired load removes the 2 destination limit. (Paired destinations were common for early double precision implementations.) > - register specifier fields are either source or dest, never both This seems mostly a code density consideration. I think using a single name for both a source and a destination is not so horrible, but I am not a hardware guy. > - instructions should take at most 1 clock (they never did) That was clearly overconstraining. > These self imposed design restrictions cause ISA designers to miss > some possible more optimal solutions. The result is things like > RISC-V's memory reference linkage structures taking 6 instructions > to build a 64-bit PC-relative address. And I'm pretty sure we won't > see any 6 instruction fusers for quite some time. I very much doubt a compiler would generate such outside of some real-time application where the time constancy might justify the code bloat. > <sound of soap box being dragged back to cupboard> I do not mean my response to be heckling. Your points are very true. However, I think fusion is a technique — like cracking — that is a natural part of an architect's toolbox.