Deutsch English Français Italiano |
<vd6lp6$prfn$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Paul A. Clayton" <paaronclayton@gmail.com> Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Tue, 24 Sep 2024 22:49:07 -0400 Organization: A noiseless patient Spider Lines: 198 Message-ID: <vd6lp6$prfn$1@dont-email.me> References: <vaqgtl$3526$1@dont-email.me> <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com> <2024Sep10.101932@mips.complang.tuwien.ac.at> <ygn8qvztf16.fsf@y.z> <2024Sep11.123824@mips.complang.tuwien.ac.at> <vbsoro$3ol1a$1@dont-email.me> <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com> <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com> <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me> <OvEFO.70694$EEm7.38286@fx16.iad> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <vc7a6h$2afrl$2@dont-email.me> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad> <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad> <vcffub$77jk$1@dont-email.me> <n7XGO.89096$15a6.87061@fx12.iad> <vcpvhs$2bgj0$1@dont-email.me> <f627965321601850d61541eca2412c88@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Fri, 27 Sep 2024 18:16:38 +0200 (CEST) Injection-Info: dont-email.me; posting-host="26631ab958218f48ba56705301f2f9fe"; logging-data="847351"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/IRQi0EINN/01CwIHYt0tPgC/u7F2E4gU=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.0 Cancel-Lock: sha1:SMLE9XqlDWbawZdHhN4oCjZoZZQ= In-Reply-To: <f627965321601850d61541eca2412c88@www.novabbs.org> Bytes: 11068 On 9/22/24 6:19 PM, MitchAlsup1 wrote: > On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote: > >> On 9/19/24 11:07 AM, EricP wrote: >> [snip] >>> If the multiplier is pipelined with a latency of 5 and throughput >>> of 1, >>> then MULL takes 5 cycles and MULL,MULH takes 6. >>> >>> But those two multiplies still are tossing away 50% of their work. >> >> I do not remember how multipliers are actually implemented — and >> am not motivated to refresh my memory at the moment — but I >> thought a multiply low would not need to generate the upper bits, >> so I do not understand where your "50% of their work" is coming >> from. > > +-----------+ +------------+ > \ mplier / \ mcand / Big input mux > +--------+ +--------+ > | | > | +--------------+ > | / / > | / / > +-- / / > / Tree / > / /--+ > / / | > / / | > +---------------+-----------+ > hi low Products > > two n-bit operands are multiplied into a 2×n-bit result. > {{All the rest is HOW not what}} So are you saying the high bits come for free? This seems contrary to the conception of sums of partial products, where some of the partial products are only needed for the upper bits and so could (it seems to me) be uncalculated if one only wanted the lower bits. >> The high result needs the low result carry-out but not the rest of >> the result. (An approximate multiply high for multiply by >> reciprocal might be useful, avoiding the low result work. There >> might also be ways that a multiplier could be configured to also >> provide bit mixing similar to middle result for generating a >> hash?) >> >> I seem to recall a PowerPC implementation did semi-pipelined 32- >> bit multiplication 16-bits at a time. This presumably saved area >> and power > > You save 1/2 of the tree area, but ultimately consume more power. The power consumption would seem to depend on how frequently both multiplier and multiplicand are larger than 16 bits. (However, I seem to recall that the mentioned implementation only checked one operand.) I suspect that for a lot of code, small values are common. There might also be some benefits in special casing small values if the multiplier supports SIMD. Small values can use substantially less physical resources for multiplication and if the multiplier is already designed to handle multiple parallel/SIMD small multiplies, being able to squeeze another scalar multiply in may be possible/practical (assuming the communication of the values is not problematic). >> while also facilitating early out for small >> multiplicands, > > Dadda showed that doubling the size of the tree only adds one > 4-2 compressor delay to the whole calculation. Interesting. [snip] >>> <sound of soap box being dragged out> >>> This idea that macro-op fusion is some magic solution is bullshit. > > The argument is, at best, of Academic Quality, made by a student > at the time as a way to justify RISC-V not having certain easy > for HW to perform calculations. The RISC-V published argument for fusion is not great, but fusion (and cracking/fission) seem natural architectural mechanisms *if* one is stuck with binary compatibility. >>> 1) It's not free. >> >> Neither is increasing the number of opcodes or providing extender >> prefixes. If one wants binary compatibility, non-fusing >> implementations would work. > > I did neither and avoided both. My 66000's CARRY and PRED are "extender prefixes", admittedly included in the original architecture so compensating for encoding constraints (e.g., not having 36-bit instruction parcels) rather than microarchitectural or architectural variation. [snip]>> (I feel that encoding some of the dependency information could >> be useful to avoid some of this work. In theory, common >> dependency detection could also be more broadly useful; e.g., >> operand availability detection and execution/operand routing.) > > So useful that it is encoded directly in My 66000 ISA. How so? My 66000 does not provide any explicit declaration what operation will be using a result (or where an operand is being sourced from). Register names express the dependencies so the dataflow graph is implicit. I was speculating that _knowing_ when an operand will be available and where a result should be sent (rather than broadcasting) could be useful information. Classic transport-triggered architectures do this but do not integrate dynamic scheduling and do not handle multiple use well (awkwardness of delayed use seems connected both of these aspects). While such information can be cached for operation networks that are revisited with reasonable temporal locality, discovering optimization opportunities dynamically has risk of not being used (similar to prefetching). Bloating the communication of "what to do" also adds cost, so early and more persistent (compile time) caching of such information may not actually be helpful. >>> 5) Any fused instructions leave (multiple) bubbles that should be >>> compacted out or there wasn't much point to doing the fusion. >> >> Even with reduced operations per cycle, fusion could still provide >> a net energy benefit. > > Here I disagree:: but for a different reason:: > > In order for RISC-V to use a 64-bit constant as an operand, it has > to execute either:: AUPIC-LD to an area of memory containing the > 64-bit constant, or a 6-7 instruction stream to build the constant > inline. While an ISA that directly supports 64-bit constants in ISA > does not execute any of those. > > Thus, while it may save power seen at the "its my ISA" level it > may save power, but when seem from the perspective of "it is > directly supported in my ISA" it wastes power. Yes, but "computing" large immediates is obviously less efficient (except for compression), the computation part is known to be unnecessary. Fusing a comparison and a branch may be a consequence of bad ISA design in not properly estimating how much work an instruction can do (and be encoded in available space) and there is excess decode overhead with separate instructions, but the individual operations seem to be doing actual work. I suspect there can be cases where different microarchitectures would benefit from different amounts of instruction/operation complexity such that cracking and/or fusion may be useful even in an optimally designed generic ISA. [snip] >>> - register specifier fields are either source or dest, never both >> >> This seems mostly a code density consideration. I think using a ========== REMAINDER OF ARTICLE TRUNCATED ==========