Article <vd6lp6$prfn$1@dont-email.me>

Deutsch English Français Italiano
<vd6lp6$prfn$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!news.swapon.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: "Paul A. Clayton" <paaronclayton@gmail.com>
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Tue, 24 Sep 2024 22:49:07 -0400
Organization: A noiseless patient Spider
Lines: 198
Message-ID: <vd6lp6$prfn$1@dont-email.me>
References: <vaqgtl$3526$1@dont-email.me>
 <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com>
 <2024Sep10.101932@mips.complang.tuwien.ac.at> <ygn8qvztf16.fsf@y.z>
 <2024Sep11.123824@mips.complang.tuwien.ac.at> <vbsoro$3ol1a$1@dont-email.me>
 <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com>
 <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com>
 <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com>
 <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me>
 <OvEFO.70694$EEm7.38286@fx16.iad>
 <32a15246310ea544570564a6ea100cab@www.novabbs.org>
 <vc7a6h$2afrl$2@dont-email.me>
 <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org>
 <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad>
 <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad>
 <vcffub$77jk$1@dont-email.me> <n7XGO.89096$15a6.87061@fx12.iad>
 <vcpvhs$2bgj0$1@dont-email.me>
 <f627965321601850d61541eca2412c88@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 27 Sep 2024 18:16:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="26631ab958218f48ba56705301f2f9fe";
	logging-data="847351"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/IRQi0EINN/01CwIHYt0tPgC/u7F2E4gU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.0
Cancel-Lock: sha1:SMLE9XqlDWbawZdHhN4oCjZoZZQ=
In-Reply-To: <f627965321601850d61541eca2412c88@www.novabbs.org>
Bytes: 11068

On 9/22/24 6:19 PM, MitchAlsup1 wrote:
> On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:
> 
>> On 9/19/24 11:07 AM, EricP wrote:
>> [snip]
>>> If the multiplier is pipelined with a latency of 5 and throughput
>>> of 1,
>>> then MULL takes 5 cycles and MULL,MULH takes 6.
>>>
>>> But those two multiplies still are tossing away 50% of their work.
>>
>> I do not remember how multipliers are actually implemented — and
>> am not motivated to refresh my memory at the moment — but I
>> thought a multiply low would not need to generate the upper bits,
>> so I do not understand where your "50% of their work" is coming
>> from.
> 
>      +-----------+   +------------+
>      \  mplier  /     \   mcand  /        Big input mux
>       +--------+       +--------+
>            |                |
>            |      +--------------+
>            |     /               /
>            |    /               /
>            +-- /               /
>               /     Tree      /
>              /               /--+
>             /               /   |
>            /               /    |
>           +---------------+-----------+
>                 hi             low        Products
> 
> two n-bit operands are multiplied into a 2×n-bit result.
> {{All the rest is HOW not what}}

So are you saying the high bits come for free? This seems
contrary to the conception of sums of partial products, where
some of the partial products are only needed for the upper bits
and so could (it seems to me) be uncalculated if one only wanted
the lower bits.

>> The high result needs the low result carry-out but not the rest of
>> the result. (An approximate multiply high for multiply by
>> reciprocal might be useful, avoiding the low result work. There
>> might also be ways that a multiplier could be configured to also
>> provide bit mixing similar to middle result for generating a
>> hash?)
>>
>> I seem to recall a PowerPC implementation did semi-pipelined 32-
>> bit multiplication 16-bits at a time. This presumably saved area
>> and power
> 
> You save 1/2 of the tree area, but ultimately consume more power.

The power consumption would seem to depend on how frequently both
multiplier and multiplicand are larger than 16 bits. (However, I
seem to recall that the mentioned implementation only checked one
operand.) I suspect that for a lot of code, small values are
common.

There might also be some benefits in special casing small values
if the multiplier supports SIMD. Small values can use
substantially less physical resources for multiplication and if
the multiplier is already designed to handle multiple
parallel/SIMD small multiplies, being able to squeeze another
scalar multiply in may be possible/practical (assuming the
communication of the values is not problematic).

>>           while also facilitating early out for small
>> multiplicands,
> 
> Dadda showed that doubling the size of the tree only adds one
> 4-2 compressor delay to the whole calculation.

Interesting.

[snip]
>>> <sound of soap box being dragged out>
>>> This idea that macro-op fusion is some magic solution is bullshit.
> 
> The argument is, at best, of Academic Quality, made by a student
> at the time as a way to justify RISC-V not having certain easy
> for HW to perform calculations.

The RISC-V published argument for fusion is not great, but fusion
(and cracking/fission) seem natural architectural mechanisms *if*
one is stuck with binary compatibility.

>>> 1) It's not free.
>>
>> Neither is increasing the number of opcodes or providing extender
>> prefixes. If one wants binary compatibility, non-fusing
>> implementations would work.
> 
> I did neither and avoided both.

My 66000's CARRY and PRED are "extender prefixes", admittedly
included in the original architecture so compensating for encoding
constraints (e.g., not having 36-bit instruction parcels) rather
than microarchitectural or architectural variation.

[snip]>> (I feel that encoding some of the dependency information 
could
>> be useful to avoid some of this work. In theory, common
>> dependency detection could also be more broadly useful; e.g.,
>> operand availability detection and execution/operand routing.)
> 
> So useful that it is encoded directly in My 66000 ISA.

How so? My 66000 does not provide any explicit declaration what
operation will be using a result (or where an operand is being
sourced from). Register names express the dependencies so the
dataflow graph is implicit.

I was speculating that _knowing_ when an operand will be available
and where a result should be sent (rather than broadcasting) could
be useful information. Classic transport-triggered architectures
do this but do not integrate dynamic scheduling and do not handle
multiple use well (awkwardness of delayed use seems connected both
of these aspects).

While such information can be cached for operation networks that
are revisited with reasonable temporal locality, discovering
optimization opportunities dynamically has risk of not being used
(similar to prefetching). Bloating the communication of "what to
do" also adds cost, so early and more persistent (compile time)
caching of such information may not actually be helpful.

>>> 5) Any fused instructions leave (multiple) bubbles that should be
>>>     compacted out or there wasn't much point to doing the fusion.
>>
>> Even with reduced operations per cycle, fusion could still provide
>> a net energy benefit.
> 
> Here I disagree:: but for a different reason::
> 
> In order for RISC-V to use a 64-bit constant as an operand, it has
> to execute either::  AUPIC-LD to an area of memory containing the
> 64-bit constant, or a 6-7 instruction stream to build the constant
> inline. While an ISA that directly supports 64-bit constants in ISA
> does not execute any of those.
> 
> Thus, while it may save power seen at the "its my ISA" level it
> may save power, but when seem from the perspective of "it is
> directly supported in my ISA" it wastes power.

Yes, but "computing" large immediates is obviously less efficient
(except for compression), the computation part is known to be
unnecessary. Fusing a comparison and a branch may be a consequence
of bad ISA design in not properly estimating how much work an
instruction can do (and be encoded in available space) and there
is excess decode overhead with separate instructions, but the
individual operations seem to be doing actual work.

I suspect there can be cases where different microarchitectures
would benefit from different amounts of instruction/operation
complexity such that cracking and/or fusion may be useful even in
an optimally designed generic ISA.

[snip]
>>> - register specifier fields are either source or dest, never both
>>
>> This seems mostly a code density consideration. I think using a
========== REMAINDER OF ARTICLE TRUNCATED ==========