Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Fri, 27 Sep 2024 18:01:40 +0000
Organization: Rocksolid Light
Message-ID: <e6e636fc3a42c9a2718490a95b1f8a7f@www.novabbs.org>
References: <vaqgtl$3526$1@dont-email.me> <vbsoro$3ol1a$1@dont-email.me> <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com> <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com> <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me> <OvEFO.70694$EEm7.38286@fx16.iad> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <vc7a6h$2afrl$2@dont-email.me> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad> <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad> <vcffub$77jk$1@dont-email.me> <n7XGO.89096$15a6.87061@fx12.iad> <vcpvhs$2bgj0$1@dont-email.me> <f627965321601850d61541eca2412c88@www.novabbs.org> <vd6lp6$prfn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="3751348"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$gQojp6dtPes3mGfParYaQ.5PPEuvaYkCwomRuFSNVZzuz9DRFDALG
Bytes: 10544
Lines: 192

On Wed, 25 Sep 2024 2:49:07 +0000, Paul A. Clayton wrote:

> On 9/22/24 6:19 PM, MitchAlsup1 wrote:
>> On Sun, 22 Sep 2024 20:43:38 +0000, Paul A. Clayton wrote:
>>
>>> On 9/19/24 11:07 AM, EricP wrote:
>>> [snip]
>>>> If the multiplier is pipelined with a latency of 5 and throughput
>>>> of 1,
>>>> then MULL takes 5 cycles and MULL,MULH takes 6.
>>>>
>>>> But those two multiplies still are tossing away 50% of their work.
>>>
>>> I do not remember how multipliers are actually implemented — and
>>> am not motivated to refresh my memory at the moment — but I
>>> thought a multiply low would not need to generate the upper bits,
>>> so I do not understand where your "50% of their work" is coming
>>> from.
>>
>>      +-----------+   +------------+
>>      \  mplier  /     \   mcand  /        Big input mux
>>       +--------+       +--------+
>>            |                |
>>            |      +--------------+
>>            |     /               /
>>            |    /               /
>>            +-- /               /
>>               /     Tree      /
>>              /               /--+
>>             /               /   |
>>            /               /    |
>>           +---------------+-----------+
>>                 hi             low        Products
>>
>> two n-bit operands are multiplied into a 2×n-bit result.
>> {{All the rest is HOW not what}}
>
> So are you saying the high bits come for free? This seems
> contrary to the conception of sums of partial products, where
> some of the partial products are only needed for the upper bits
> and so could (it seems to me) be uncalculated if one only wanted
> the lower bits.

The high order bits are free WRT gates of delay, but consume as much
area as the lower order bits. I was answering the question of
"I do not remember how multipliers are actually implemented".

>>> The high result needs the low result carry-out but not the rest of
>>> the result. (An approximate multiply high for multiply by
>>> reciprocal might be useful, avoiding the low result work. There
>>> might also be ways that a multiplier could be configured to also
>>> provide bit mixing similar to middle result for generating a
>>> hash?)
>>>
>>> I seem to recall a PowerPC implementation did semi-pipelined 32-
>>> bit multiplication 16-bits at a time. This presumably saved area
>>> and power
>>
>> You save 1/2 of the tree area, but ultimately consume more power.
>
> The power consumption would seem to depend on how frequently both
> multiplier and multiplicand are larger than 16 bits. (However, I
> seem to recall that the mentioned implementation only checked one
> operand.) I suspect that for a lot of code, small values are
> common.

It is 100% of the time in FP codes, and generally unknowable in
integer codes.
<snip>
>
> My 66000's CARRY and PRED are "extender prefixes", admittedly
> included in the original architecture so compensating for encoding
> constraints (e.g., not having 36-bit instruction parcels) rather
> than microarchitectural or architectural variation.

Since they cast extra bits over a number of instructions, and
while they precede the instructions they modify, they are not
classical prefixes--so I use the term Instruction-modifier instead.

> [snip]>> (I feel that encoding some of the dependency information
> could
>>> be useful to avoid some of this work. In theory, common
>>> dependency detection could also be more broadly useful; e.g.,
>>> operand availability detection and execution/operand routing.)
>>
>> So useful that it is encoded directly in My 66000 ISA.
>
> How so? My 66000 does not provide any explicit declaration what
> operation will be using a result (or where an operand is being
> sourced from). Register names express the dependencies so the
> dataflow graph is implicit.

I was talking about how operand routing is explicitly described
in ISA--which is mainly about how constants override register
file reads by the time operands get to the calculation unit.

> I was speculating that _knowing_ when an operand will be available
> and where a result should be sent (rather than broadcasting) could
> be useful information.

It is easier to record which FU will deliver a result, the when
part is simply a pipeline sequencer from the end of a FU to the
entries in the reservation station.


>>> Even with reduced operations per cycle, fusion could still provide
>>> a net energy benefit.
>>
>> Here I disagree:: but for a different reason::
>>
>> In order for RISC-V to use a 64-bit constant as an operand, it has
>> to execute either::  AUPIC-LD to an area of memory containing the
>> 64-bit constant, or a 6-7 instruction stream to build the constant
>> inline. While an ISA that directly supports 64-bit constants in ISA
>> does not execute any of those.
>>
>> Thus, while it may save power seen at the "its my ISA" level it
>> may save power, but when seem from the perspective of "it is
>> directly supported in my ISA" it wastes power.
>
> Yes, but "computing" large immediates is obviously less efficient
> (except for compression), the computation part is known to be
> unnecessary. Fusing a comparison and a branch may be a consequence
> of bad ISA design in not properly estimating how much work an
> instruction can do (and be encoded in available space) and there
> is excess decode overhead with separate instructions, but the
> individual operations seem to be doing actual work.
>
> I suspect there can be cases where different microarchitectures
> would benefit from different amounts of instruction/operation
> complexity such that cracking and/or fusion may be useful even in
> an optimally designed generic ISA.
>
> [snip]
>>>> - register specifier fields are either source or dest, never both
>>>
>>> This seems mostly a code density consideration. I think using a
>>> single name for both a source and a destination is not so
>>> horrible, but I am not a hardware guy.
>>
>> All we HW guys want is the where ever the field is specified,
>> it is specified in exactly 1 field in the instruction. So, if
>> field<a..b> is used to specify Rd in one instruction, there is
>> no other field<!a..!b> specifies the Rd register. RISC-V blew
>> this "requirement.
>
> Only with the Compressed extension, I think. The Compressed
> extension was somewhat rushed and, in my opinion, philosophically
> flawed by being redundant (i.e., every C instruction can be
> expanded to a non-C instruction). Things like My 66000's ENTER
> provide code density benefits but are contrary to the simplicity
> emphasis. Perhaps a Rho (density) extension would have been
> better.☺ (The extension letter idea was interesting for an
> academic ISA but has been clearly shown to be seriously flawed.)

The R in RISC-V does not represent REDUCED.

> 16-bit instructions could have kept the same register field
> placements with masking/truncation for two-register-field
> instructions.

The whole layout of the ISA is sloppy...

>               Even a non-destructive form might be provided by
> different masking or bit inversion for the destination. However,
> providing three register fields seems to require significant
> irregularity in extracting register names. (Another technique
> would be using opcode bits for specifying part or all of a
> register name. Some special purpose registers or groups of
> registers may not be horrible for compiler register allocation,
> but such seems rather funky/clunky.)
>
> It is interesting that RISC-V chose to split the immediate field
> for store instructions so that source register names would be in
> the same place for all (non-C) instructions.

Lipstick on a pig.

> Comparing an ISA design to RISC-V is not exactly the same as
========== REMAINDER OF ARTICLE TRUNCATED ==========