Article <vcffub$77jk$1@dont-email.me>

Deutsch English Français Italiano
<vcffub$77jk$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.roellig-ltd.de!news.mb-net.net!open-news-network.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Brett <ggtgp@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Wed, 18 Sep 2024 21:15:55 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 135
Message-ID: <vcffub$77jk$1@dont-email.me>
References: <vaqgtl$3526$1@dont-email.me>
 <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com>
 <2024Sep10.101932@mips.complang.tuwien.ac.at>
 <ygn8qvztf16.fsf@y.z>
 <2024Sep11.123824@mips.complang.tuwien.ac.at>
 <vbsoro$3ol1a$1@dont-email.me>
 <867cbhgozo.fsf@linuxsc.com>
 <20240912142948.00002757@yahoo.com>
 <vbuu5n$9tue$1@dont-email.me>
 <20240915001153.000029bf@yahoo.com>
 <vc6jbk$5v9f$1@paganini.bofh.team>
 <20240915154038.0000016e@yahoo.com>
 <vc70sl$285g2$4@dont-email.me>
 <vc73bl$28v0v$1@dont-email.me>
 <OvEFO.70694$EEm7.38286@fx16.iad>
 <32a15246310ea544570564a6ea100cab@www.novabbs.org>
 <vc7a6h$2afrl$2@dont-email.me>
 <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org>
 <vc8qic$2od19$1@dont-email.me>
 <fCXFO.4617$9Rk4.4393@fx37.iad>
 <vcb730$3ci7o$1@dont-email.me>
 <7cBGO.169512$_o_3.43954@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 18 Sep 2024 23:15:56 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="bc80ef6b1351de8f33410d6d6d3d3557";
	logging-data="237172"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/nbBCoSvgG0tn63UBlQy/X"
User-Agent: NewsTap/5.5 (iPad)
Cancel-Lock: sha1:hpyjYfJm25ZmYZBbzfYxqWExeXg=
	sha1:D2iI24mb4NI2OMOoJP7oQFC2D4E=
Bytes: 7454

EricP <ThatWouldBeTelling@thevillage.com> wrote:
> Terje Mathisen wrote:
>> EricP wrote:
>> 
>>> Codecs likely have to deal with double-width straddles a lot, whatever
>>> the register word size. So for them it likely happens at 64-bits already.
>> 
>> Nothing likely about it: LZ4 is pretty much the only compression 
>> algorithm/lossless codec that never straddles, all the rest tend to 
>> treat the source data as single bitstream of arbitrary length, except 
>> for some built-in chunking mechanism which simplifies faster scanning.
>> 
>> The core of the algorithm always starts with knowing the endianness, 
>> then picking up 32 or 64-bit chunks of input data (byte-flipping if 
>> needed) and then extractin the next N bits either from the top of bottom 
>> of the buffer register.
>> 
>> AlLmost by definition, this is not code that a compiler is setup to help 
>> you get correct.
>> 
>>> 
>>> I added a bunch of instructions for dealing with double-width operations.
>>> The main ISA design decision is whether to have register pair specifiers,
>>> R0, R2, R4,... or two separate {r_high,r_low} registers.
>>> In either case the main uArch issue is that now instructions have an 
>>> extra
>>> source register and two dest registers, which has a number of 
>>> consequences.
>>> But once you bite the bullet on that it simplifies a lot of things,
>>> like how to deal with carry or overflow without flags,
>>> full width multiplies, divide producing both quotient and remainder.
>> 
>> Very nice!
>> 
>> This means that you can do integer IMAC(), right?
>> 
>> (hi, lo) = imac(a, b, c); // == a*b+c
>> 
>> The only thing even nicer from the perspective of writing arbitrary 
>> precision library code would be IMAA, i.e. a*b+c+d since that is the 
>> largest combination which is guaranteed to never overflow the double 
>> register target field.
>> 
>> Terje
>> 
> 
> I thought about IMAC but it was a bit too much.
> And unlike FMA there is no precision gain in IMAC, just convenience.
> IMAC requires 6 register specifiers, 2 dest and 4 source if you don't
> care about overflow/carry on the accumulate.
>   2-wide = 2-wide + narrow * narrow
> It needs 7 registers, 3 dest and 4 source if you want overflow/carry
> on the accumulate.
>   3-wide = 2-wide + narrow * narrow
> 
> I wanted to support checked arithmetic which means full width multiplies.
> And I was always bothered by the risc approach of MULL (low part) and
> MULH (high part) where they do most of the multiply then toss half away
> just because they won't have 2 dest registers.

I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:

https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit

They claim 5 cycles, should be six, five for the multiply and one more for
the second result, unless the next instruction does not need a write port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.

> So what else I can do with 2 dest registers? Wide add and sub.
> Various wide Add,Sub solves the missing carry/overflow flags problems.
> 
> FMA already requires 3 source registers.
> Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers?
> Wide shifts and wide bit-field extract and insert.
> 
> I went with two (r_hi,r_lo) register specifiers because it gave programmers
> more flexibility. I played a bit with even register pairs (R0, R2, R4...)
> and found one had to do extra MOVs just form a pair.
> (r_hi,r_lo) cost a longer instruction format but I have a variable length
> instruction so its mostly a wider fetch and decode pathways to handle
> the worst case instruction size.
> 
> W = Wide = (hi,lo) register pair, N = Narrow = one register.
> 
> Add forms:
> Add   N = N + N        // No carry out
> Add3  N = N + N + N    // No carry out
> Addw2 W = N + N        // Generate carry
> Addw3 W = N + N + N    // Generate + propagate carry
> Addw1 W = W + N        // Propagate carry
> 
> Same for subtract wide.
> The three Add forms are chosen to make multi-precision integer
> multiply easier. See below.
> 
> MUluw W = N * N
> Mulsw W = N * N
> 
> Divuw  (quo,rem) = N / N
> Divsw  (quo,rem) = N / N
> 
> Shllw  W = W << size  // Shift left logical
> Shlaw  W = W << size  // Shift left arithmetic, fault on signed overflow
> Shrlw  W = W >> size  // Shift right logical
> Shraw  W = W >> size  // Shift right arithmetic, sign extend
> Shrnw  W = W >> size  // Shift right numeric, round -1 to zero
> 
> Bfextu N = extract (W, size, position)    // Bit-field extract, zero extend
> Bfexts N = extract (W, size, position)    // Bit-field extract, sign extend
> Bfins  W = insert  (W, N, size, position) // Bit-field insert
> 
> =====================================
> Example unsigned 128 * 128 => 256 multiply:
> 
> // Unsigned Multiply 128*128 => 256
> // (r3,r2)*(r1,r0) => (r3,r2,r1,r0)
> // Uses r4,r5,r6,r7,r8 as temp registers
> //
> muluw r5,r4 = r3*r0
> muluw r6,r0 = r2*r0
> muluw r8,r7 = r2*r1
> muluw r3,r2 = r3*r1
> addw3 r4,r1 = r4+r6+r7
> addw3 r5,r2 = r5+r8+r2
> addw2 r4,r2 = r2+r4
> add3     r3 = r3+r5+r4
> 
> The reason I prefer the separate (r_hi,r_lo) pair specifiers rather
> than the even number register pairs R0,R2,R4... is because the above
> sequence would require extra moves for form the even numbered pairs.
> With separate pairs one can select registers so that everything lands
> in the right dest at the right time.