Article <vcssro$2t2ms$1@dont-email.me>

Deutsch English Français Italiano
<vcssro$2t2ms$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB-Alt <bohannonindustriesllc@gmail.com>
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Mon, 23 Sep 2024 18:16:08 -0500
Organization: A noiseless patient Spider
Lines: 257
Message-ID: <vcssro$2t2ms$1@dont-email.me>
References: <vaqgtl$3526$1@dont-email.me>
 <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com>
 <2024Sep10.101932@mips.complang.tuwien.ac.at> <ygn8qvztf16.fsf@y.z>
 <2024Sep11.123824@mips.complang.tuwien.ac.at> <vbsoro$3ol1a$1@dont-email.me>
 <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com>
 <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com>
 <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com>
 <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me>
 <OvEFO.70694$EEm7.38286@fx16.iad>
 <32a15246310ea544570564a6ea100cab@www.novabbs.org>
 <vc7a6h$2afrl$2@dont-email.me>
 <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org>
 <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad>
 <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad>
 <vcffub$77jk$1@dont-email.me> <n7XGO.89096$15a6.87061@fx12.iad>
 <vcpvhs$2bgj0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 24 Sep 2024 01:16:09 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e245698bb5d2b6b616652557cfecad09";
	logging-data="3050204"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19SzBitr5m+HkZ3074pOUPfSiWAJ6eaUgk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:AYUAbip09zk2brevvgOwmx8IMaA=
In-Reply-To: <vcpvhs$2bgj0$1@dont-email.me>
Content-Language: en-US
Bytes: 12265

On 9/22/2024 3:43 PM, Paul A. Clayton wrote:
> On 9/19/24 11:07 AM, EricP wrote:
> [snip]
>> If the multiplier is pipelined with a latency of 5 and throughput of 1,
>> then MULL takes 5 cycles and MULL,MULH takes 6.
>>
>> But those two multiplies still are tossing away 50% of their work.
> 
> I do not remember how multipliers are actually implemented — and
> am not motivated to refresh my memory at the moment — but I
> thought a multiply low would not need to generate the upper bits,
> so I do not understand where your "50% of their work" is coming
> from.
> 
> The high result needs the low result carry-out but not the rest of
> the result. (An approximate multiply high for multiply by
> reciprocal might be useful, avoiding the low result work. There
> might also be ways that a multiplier could be configured to also
> provide bit mixing similar to middle result for generating a
> hash?)
> 

I guess it might be interesting if one made a bigger multiplier out of 
4-bit multipliers, in a way similar to a 4-bit shift-add.

Could in theory do a 64-bit multiply in ~ 16 cycles this way...

But, probably couldn't use this for divide.


> I seem to recall a PowerPC implementation did semi-pipelined 32-
> bit multiplication 16-bits at a time. This presumably saved area
> and power while also facilitating early out for small
> multiplicands, at the cost of some latency and substantial
> throughput compared to a fully pipelined multiplication. If I
> remember correctly, this produced a result for 16-bit by 32-bit
> multiplication, which is different from generating a low or high
> result.
> 

On an FPGA, one could almost argue for doing everything with an unsigned 
16*16->32 bit multiplier and some helper instructions.

Function calls to do multiply would be kinda lame though, as in cases 
like this, function call/return overhead can become a significant part 
of the total cycle cost (but, still not really cheap enough to justify 
doing it inline).


>> And if it does fuse them then the internal uArch cost is the same as if
>> you had designed it optimally from the start, except now you have
>> to pay for a fuser.
>>
>> <sound of soap box being dragged out>
>> This idea that macro-op fusion is some magic solution is bullshit.
>> 1) It's not free.
> 
> Neither is increasing the number of opcodes or providing extender
> prefixes. If one wants binary compatibility, non-fusing
> implementations would work.
> 
> (I tend to favor providing a translation layer between software
> distribution format and instruction cache format, which reduces
> the binary compatibility constraint.)
> 

Yes, it would be preferable.
Sadly, no "good" VM has caught on.

As-in, sanely designed and can run C and C++ and similar effectively 
without being overly tied to a specific platform.

Among other things, this does likely mean the core C runtime library 
will need to run inside the VM, and probably static linked (C library 
implementations being prone to leak details like how their "stdio" 
implementation is implemented, etc).


>> 2) It only works where Decode can see *all* the required lookahead
>>     instructions, which means you have to pay for an N-lane decoder
>>     but only get 1 lane.
> 
> Most fusion is for two adjacent instructions, which significantly
> limits the complexity. The fusable patterns are also a subset of
> all pairs of two instructions, so complete two-way decoding may
> not be needed.
> 
> There may also be optimization opportunities from looking ahead.
> Mitch Alsup proposed such for branch handling in a scalar
> implementation. Apart from fusion, there might be advantages for
> avoiding bank conflicts in a banked register file. I.e., the cost
> of lookahead might be shared by multiple techniques/optimizations.
> 
> I tend to agree that fusion tends to be a workaround for sub-
> optimal instruction encoding, but it seems that encoding involves
> a lot of tradeoffs.
> 

Yeah...

However, the cost of doing fusion is higher than having longer-form 
variable-length instructions via prefixes...

If one wants a cheapish way to do prefixes on a 1-wide machine, they 
could transpose the instruction words during fetch, and then only need a 
single decoder.

So:
   WordA
   PrefixA WordB
   PrefixA PrefixB WordC

Is presented to the decoder as:
   WordA
   WordB PrefixA
   WordC PrefixB PrefixA

So, the decoder doesn't move...

Possibly, a similar trick could be used for 2-wide with limited 
variable-length, but would get more complicated.


>> 3) It's probabilistic as it depends on how the fetch buffers get loaded.
>>     Eg if the fetch buffer contains a valid instruction but does not have
>>     a next instruction, do you stall Decode to see if a fuser might 
>> arrive
>>     or dispatch it anyway.
> 
> This is also somewhat true for variable length encodings that cross 
> fetch boundaries. In general a boundary-crossing instruction
> would probably stall even if such was not strictly necessary
> (e.g., if the missing information is opcode refinement — not
> related to instruction routing — or an immediate or even a
> register source identifier specifying a value that can have
> delayed use (e.g., value of a store, addend of a FMADD).
> 
> This does seem a weakness, but fusion is not entirely negative
> factors.
> 
>> 4) It gets exponentially expensive if you start doing multiple 
>> instruction
>>     lanes because decode has to deal with all the permutations of
>>     fusion possibilities.
> 
> This is also a factor in mere superscalar decode/execute.
> Detecting that an instruction is dependent on another would
> normally stall the execution of that instruction.
> 
> (I feel that encoding some of the dependency information could
> be useful to avoid some of this work. In theory, common
> dependency detection could also be more broadly useful; e.g.,
> operand availability detection and execution/operand routing.)
> 

Superscalar is easier, as here one merely needs to categorize the 
instruction with feature flags (such as which lanes it is allowed in), 
and check for register dependencies.

This is less of an issue than what is needed for fusion (namely, 
detecting specific pairs of instructions via pattern matching); unless 
the fusion is very limited in scope.


========== REMAINDER OF ARTICLE TRUNCATED ==========