Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Thu, 19 Sep 2024 16:01:48 +0000 Organization: Rocksolid Light Message-ID: <35b8ff2e6baa54c7aa22ec4edf45c3f9@www.novabbs.org> References: <2024Sep10.101932@mips.complang.tuwien.ac.at> <2024Sep11.123824@mips.complang.tuwien.ac.at> <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <20240915001153.000029bf@yahoo.com> <20240915154038.0000016e@yahoo.com> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <7cBGO.169512$_o_3.43954@fx17.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="2662534"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Rslight-Site: $2y$10$fgb3OB10o68XHQw9bjC9d.ORJBRzabTK/Gyxlb9zmhh8ozkdQOp92 X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 5916 Lines: 96 On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote: > Brett wrote: >> EricP wrote: >> >> They claim 5 cycles, should be six, five for the multiply and one more >> for >> the second result, unless the next instruction does not need a write >> port, >> and does not use the result. You can get a throughput of 5 cycles with >> smart coding, but that rarely happens without effort. > > That article is ignoring multiplier pipelining. > If the multiplier is pipelined with a latency of 5 and throughput of 1, > then MULL takes 5 cycles and MULL,MULH takes 6. > > But those two multiplies still are tossing away 50% of their work. > And if it does fuse them then the internal uArch cost is the same as if > you had designed it optimally from the start, except now you have > to pay for a fuser. You failed to recognize the critical part of my comment on this:: When the IMUL function unit sees MULL and MULH back to back AND when both operands are the same for both instructions; it KNOWS that the second multiply has the same result as the first and thereby that the second multiply can be suppressed and the first multiply used twice. {{In pure CMOS, if you drop the same operands twice into the multiplier tree, the multiplier tree burns no power in any event, just the operand delivery power.}} You may call this fusion, but it is the very lowest level of it and was not called such when first used. > > This idea that macro-op fusion is some magic solution is bullshit. Agreed > 1) It's not free. Far from it. > 2) It only works where Decode can see *all* the required lookahead > instructions, which means you have to pay for an N-lane decoder > but only get 1 lane. I think it is but a crutch for a misdesigned ISA > 3) It's probabilistic as it depends on how the fetch buffers get loaded. > Eg if the fetch buffer contains a valid instruction but does not > have > a next instruction, do you stall Decode to see if a fuser might > arrive > or dispatch it anyway. It can be worse than that > 4) It gets exponentially expensive if you start doing multiple > instruction > lanes because decode has to deal with all the permutations of > fusion possibilities. All the more reason to have a better ISA > 5) Any fused instructions leave (multiple) bubbles that should be > compacted out or there wasn't much point to doing the fusion. One of the interesting things I have noticed with my ISA is that when one has a properly designed higher level ISA, one gets rid of so many of the "easy to schedule" instructions that one ends up with 30 FMAC instructions in a row, with no other instruction to occupy any of the other function units. > In my opinion it is better to have an ISA that is optimal by design > rather than being patched up by fusion later. Indeed. > Some of this inefficiency is caused by clinging to now 40 year old > risc design *guidelines* (ie not even rules) that: > - instructions have at most 1 dest and 2 source registers Makes FMAC had > - register specifier fields are either source or dest, never both I happen to be wishywashy on this > - instructions should take at most 1 clock (they never did) This never worked for floating point anyway...and many consider branches and memory references as not fitting that tenet either. What is required is that each instruction can be decoded in a single cycle and delivered to whichever function unit in one cycle. > These self imposed design restrictions cause ISA designers to miss > some possible more optimal solutions. The result is things like > RISC-V's memory reference linkage structures taking 6 instructions > to build a 64-bit PC-relative address. And I'm pretty sure we won't > see any 6 instruction fusers for quite some time. And it is just "so unnecessary". I suspect that RISC-V will end up choosing AUPIC-LD-JMP instead loosing the PIC nature of flow control. Doing it right the first time is so much easier for everyone now and down the line. > >