Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Fri, 27 Sep 2024 18:01:40 +0000 Organization: Rocksolid Light Message-ID: References: <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <20240915001153.000029bf@yahoo.com> <20240915154038.0000016e@yahoo.com> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <7cBGO.169512$_o_3.43954@fx17.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="3751348"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$gQojp6dtPes3mGfParYaQ.5PPEuvaYkCwomRuFSNVZzuz9DRFDALG Bytes: 10544 Lines: 192 On Wed, 25 Sep 2024 2:49:07 +0000, Paul A. Clayton wrote: > On 9/22/24 6:19 PM, MitchAlsup1 wrote: >> On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote: >> >>> On 9/19/24 11:07 AM, EricP wrote: >>> [snip] >>>> If the multiplier is pipelined with a latency of 5 and throughput >>>> of 1, >>>> then MULL takes 5 cycles and MULL,MULH takes 6. >>>> >>>> But those two multiplies still are tossing away 50% of their work. >>> >>> I do not remember how multipliers are actually implemented — and >>> am not motivated to refresh my memory at the moment — but I >>> thought a multiply low would not need to generate the upper bits, >>> so I do not understand where your "50% of their work" is coming >>> from. >> >>     +-----------+   +------------+ >>     \  mplier  /     \   mcand  /        Big input mux >>      +--------+       +--------+ >>           |                | >>           |      +--------------+ >>           |     /               / >>           |    /               / >>           +-- /               / >>              /     Tree      / >>             /               /--+ >>            /               /   | >>           /               /    | >>          +---------------+-----------+ >>                hi             low        Products >> >> two n-bit operands are multiplied into a 2×n-bit result. >> {{All the rest is HOW not what}} > > So are you saying the high bits come for free? This seems > contrary to the conception of sums of partial products, where > some of the partial products are only needed for the upper bits > and so could (it seems to me) be uncalculated if one only wanted > the lower bits. The high order bits are free WRT gates of delay, but consume as much area as the lower order bits. I was answering the question of "I do not remember how multipliers are actually implemented". >>> The high result needs the low result carry-out but not the rest of >>> the result. (An approximate multiply high for multiply by >>> reciprocal might be useful, avoiding the low result work. There >>> might also be ways that a multiplier could be configured to also >>> provide bit mixing similar to middle result for generating a >>> hash?) >>> >>> I seem to recall a PowerPC implementation did semi-pipelined 32- >>> bit multiplication 16-bits at a time. This presumably saved area >>> and power >> >> You save 1/2 of the tree area, but ultimately consume more power. > > The power consumption would seem to depend on how frequently both > multiplier and multiplicand are larger than 16 bits. (However, I > seem to recall that the mentioned implementation only checked one > operand.) I suspect that for a lot of code, small values are > common. It is 100% of the time in FP codes, and generally unknowable in integer codes. > > My 66000's CARRY and PRED are "extender prefixes", admittedly > included in the original architecture so compensating for encoding > constraints (e.g., not having 36-bit instruction parcels) rather > than microarchitectural or architectural variation. Since they cast extra bits over a number of instructions, and while they precede the instructions they modify, they are not classical prefixes--so I use the term Instruction-modifier instead. > [snip]>> (I feel that encoding some of the dependency information > could >>> be useful to avoid some of this work. In theory, common >>> dependency detection could also be more broadly useful; e.g., >>> operand availability detection and execution/operand routing.) >> >> So useful that it is encoded directly in My 66000 ISA. > > How so? My 66000 does not provide any explicit declaration what > operation will be using a result (or where an operand is being > sourced from). Register names express the dependencies so the > dataflow graph is implicit. I was talking about how operand routing is explicitly described in ISA--which is mainly about how constants override register file reads by the time operands get to the calculation unit. > I was speculating that _knowing_ when an operand will be available > and where a result should be sent (rather than broadcasting) could > be useful information. It is easier to record which FU will deliver a result, the when part is simply a pipeline sequencer from the end of a FU to the entries in the reservation station. >>> Even with reduced operations per cycle, fusion could still provide >>> a net energy benefit. >> >> Here I disagree:: but for a different reason:: >> >> In order for RISC-V to use a 64-bit constant as an operand, it has >> to execute either::  AUPIC-LD to an area of memory containing the >> 64-bit constant, or a 6-7 instruction stream to build the constant >> inline. While an ISA that directly supports 64-bit constants in ISA >> does not execute any of those. >> >> Thus, while it may save power seen at the "its my ISA" level it >> may save power, but when seem from the perspective of "it is >> directly supported in my ISA" it wastes power. > > Yes, but "computing" large immediates is obviously less efficient > (except for compression), the computation part is known to be > unnecessary. Fusing a comparison and a branch may be a consequence > of bad ISA design in not properly estimating how much work an > instruction can do (and be encoded in available space) and there > is excess decode overhead with separate instructions, but the > individual operations seem to be doing actual work. > > I suspect there can be cases where different microarchitectures > would benefit from different amounts of instruction/operation > complexity such that cracking and/or fusion may be useful even in > an optimally designed generic ISA. > > [snip] >>>> - register specifier fields are either source or dest, never both >>> >>> This seems mostly a code density consideration. I think using a >>> single name for both a source and a destination is not so >>> horrible, but I am not a hardware guy. >> >> All we HW guys want is the where ever the field is specified, >> it is specified in exactly 1 field in the instruction. So, if >> field is used to specify Rd in one instruction, there is >> no other field specifies the Rd register. RISC-V blew >> this "requirement. > > Only with the Compressed extension, I think. The Compressed > extension was somewhat rushed and, in my opinion, philosophically > flawed by being redundant (i.e., every C instruction can be > expanded to a non-C instruction). Things like My 66000's ENTER > provide code density benefits but are contrary to the simplicity > emphasis. Perhaps a Rho (density) extension would have been > better.☺ (The extension letter idea was interesting for an > academic ISA but has been clearly shown to be seriously flawed.) The R in RISC-V does not represent REDUCED. > 16-bit instructions could have kept the same register field > placements with masking/truncation for two-register-field > instructions. The whole layout of the ISA is sloppy... > Even a non-destructive form might be provided by > different masking or bit inversion for the destination. However, > providing three register fields seems to require significant > irregularity in extracting register names. (Another technique > would be using opcode bits for specifying part or all of a > register name. Some special purpose registers or groups of > registers may not be horrible for compiler register allocation, > but such seems rather funky/clunky.) > > It is interesting that RISC-V chose to split the immediate field > for store instructions so that source register names would be in > the same place for all (non-C) instructions. Lipstick on a pig. > Comparing an ISA design to RISC-V is not exactly the same as ========== REMAINDER OF ARTICLE TRUNCATED ==========