Deutsch English Français Italiano |
<vcffub$77jk$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.roellig-ltd.de!news.mb-net.net!open-news-network.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Brett <ggtgp@yahoo.com> Newsgroups: comp.arch Subject: Re: Computer architects leaving Intel... Date: Wed, 18 Sep 2024 21:15:55 -0000 (UTC) Organization: A noiseless patient Spider Lines: 135 Message-ID: <vcffub$77jk$1@dont-email.me> References: <vaqgtl$3526$1@dont-email.me> <p1cvdjpqjg65e6e3rtt4ua6hgm79cdfm2n@4ax.com> <2024Sep10.101932@mips.complang.tuwien.ac.at> <ygn8qvztf16.fsf@y.z> <2024Sep11.123824@mips.complang.tuwien.ac.at> <vbsoro$3ol1a$1@dont-email.me> <867cbhgozo.fsf@linuxsc.com> <20240912142948.00002757@yahoo.com> <vbuu5n$9tue$1@dont-email.me> <20240915001153.000029bf@yahoo.com> <vc6jbk$5v9f$1@paganini.bofh.team> <20240915154038.0000016e@yahoo.com> <vc70sl$285g2$4@dont-email.me> <vc73bl$28v0v$1@dont-email.me> <OvEFO.70694$EEm7.38286@fx16.iad> <32a15246310ea544570564a6ea100cab@www.novabbs.org> <vc7a6h$2afrl$2@dont-email.me> <50cd3ba7c0cbb587a55dd67ae46fc9ce@www.novabbs.org> <vc8qic$2od19$1@dont-email.me> <fCXFO.4617$9Rk4.4393@fx37.iad> <vcb730$3ci7o$1@dont-email.me> <7cBGO.169512$_o_3.43954@fx17.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 18 Sep 2024 23:15:56 +0200 (CEST) Injection-Info: dont-email.me; posting-host="bc80ef6b1351de8f33410d6d6d3d3557"; logging-data="237172"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/nbBCoSvgG0tn63UBlQy/X" User-Agent: NewsTap/5.5 (iPad) Cancel-Lock: sha1:hpyjYfJm25ZmYZBbzfYxqWExeXg= sha1:D2iI24mb4NI2OMOoJP7oQFC2D4E= Bytes: 7454 EricP <ThatWouldBeTelling@thevillage.com> wrote: > Terje Mathisen wrote: >> EricP wrote: >> >>> Codecs likely have to deal with double-width straddles a lot, whatever >>> the register word size. So for them it likely happens at 64-bits already. >> >> Nothing likely about it: LZ4 is pretty much the only compression >> algorithm/lossless codec that never straddles, all the rest tend to >> treat the source data as single bitstream of arbitrary length, except >> for some built-in chunking mechanism which simplifies faster scanning. >> >> The core of the algorithm always starts with knowing the endianness, >> then picking up 32 or 64-bit chunks of input data (byte-flipping if >> needed) and then extractin the next N bits either from the top of bottom >> of the buffer register. >> >> AlLmost by definition, this is not code that a compiler is setup to help >> you get correct. >> >>> >>> I added a bunch of instructions for dealing with double-width operations. >>> The main ISA design decision is whether to have register pair specifiers, >>> R0, R2, R4,... or two separate {r_high,r_low} registers. >>> In either case the main uArch issue is that now instructions have an >>> extra >>> source register and two dest registers, which has a number of >>> consequences. >>> But once you bite the bullet on that it simplifies a lot of things, >>> like how to deal with carry or overflow without flags, >>> full width multiplies, divide producing both quotient and remainder. >> >> Very nice! >> >> This means that you can do integer IMAC(), right? >> >> (hi, lo) = imac(a, b, c); // == a*b+c >> >> The only thing even nicer from the perspective of writing arbitrary >> precision library code would be IMAA, i.e. a*b+c+d since that is the >> largest combination which is guaranteed to never overflow the double >> register target field. >> >> Terje >> > > I thought about IMAC but it was a bit too much. > And unlike FMA there is no precision gain in IMAC, just convenience. > IMAC requires 6 register specifiers, 2 dest and 4 source if you don't > care about overflow/carry on the accumulate. > 2-wide = 2-wide + narrow * narrow > It needs 7 registers, 3 dest and 4 source if you want overflow/carry > on the accumulate. > 3-wide = 2-wide + narrow * narrow > > I wanted to support checked arithmetic which means full width multiplies. > And I was always bothered by the risc approach of MULL (low part) and > MULH (high part) where they do most of the multiply then toss half away > just because they won't have 2 dest registers. I always assumed that MULH just grabbed the part that would have been thrown away. And that is how at least one RISC-V core does it: https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit They claim 5 cycles, should be six, five for the multiply and one more for the second result, unless the next instruction does not need a write port, and does not use the result. You can get a throughput of 5 cycles with smart coding, but that rarely happens without effort. > So what else I can do with 2 dest registers? Wide add and sub. > Various wide Add,Sub solves the missing carry/overflow flags problems. > > FMA already requires 3 source registers. > Beside Add,Sub,Mul what else can one do with 3 source and 2 dest registers? > Wide shifts and wide bit-field extract and insert. > > I went with two (r_hi,r_lo) register specifiers because it gave programmers > more flexibility. I played a bit with even register pairs (R0, R2, R4...) > and found one had to do extra MOVs just form a pair. > (r_hi,r_lo) cost a longer instruction format but I have a variable length > instruction so its mostly a wider fetch and decode pathways to handle > the worst case instruction size. > > W = Wide = (hi,lo) register pair, N = Narrow = one register. > > Add forms: > Add N = N + N // No carry out > Add3 N = N + N + N // No carry out > Addw2 W = N + N // Generate carry > Addw3 W = N + N + N // Generate + propagate carry > Addw1 W = W + N // Propagate carry > > Same for subtract wide. > The three Add forms are chosen to make multi-precision integer > multiply easier. See below. > > MUluw W = N * N > Mulsw W = N * N > > Divuw (quo,rem) = N / N > Divsw (quo,rem) = N / N > > Shllw W = W << size // Shift left logical > Shlaw W = W << size // Shift left arithmetic, fault on signed overflow > Shrlw W = W >> size // Shift right logical > Shraw W = W >> size // Shift right arithmetic, sign extend > Shrnw W = W >> size // Shift right numeric, round -1 to zero > > Bfextu N = extract (W, size, position) // Bit-field extract, zero extend > Bfexts N = extract (W, size, position) // Bit-field extract, sign extend > Bfins W = insert (W, N, size, position) // Bit-field insert > > ===================================== > Example unsigned 128 * 128 => 256 multiply: > > // Unsigned Multiply 128*128 => 256 > // (r3,r2)*(r1,r0) => (r3,r2,r1,r0) > // Uses r4,r5,r6,r7,r8 as temp registers > // > muluw r5,r4 = r3*r0 > muluw r6,r0 = r2*r0 > muluw r8,r7 = r2*r1 > muluw r3,r2 = r3*r1 > addw3 r4,r1 = r4+r6+r7 > addw3 r5,r2 = r5+r8+r2 > addw2 r4,r2 = r2+r4 > add3 r3 = r3+r5+r4 > > The reason I prefer the separate (r_hi,r_lo) pair specifiers rather > than the even number register pairs R0,R2,R4... is because the above > sequence would require extra moves for form the even numbered pairs. > With separate pairs one can select registers so that everything lands > in the right dest at the right time.