Deutsch English Français Italiano |
<v22vqg$11i4m$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Making Lemonade (Floating-point format changes) Date: Wed, 15 May 2024 13:44:25 -0500 Organization: A noiseless patient Spider Lines: 120 Message-ID: <v22vqg$11i4m$1@dont-email.me> References: <abe04jhkngt2uun1e7ict8vmf1fq8p7rnm@4ax.com> <v1qga3$2odn2$1@dont-email.me> <20240515120713.00001904@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 15 May 2024 20:44:33 +0200 (CEST) Injection-Info: dont-email.me; posting-host="cc705c73c3df044df6bb70f599f728ad"; logging-data="1099926"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18d84zT1DEfgUEyRZ7CDMmpuR9n3ENz8aQ=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:DWnW+SRCch+wx5H6s4xFAI2KeL4= Content-Language: en-US In-Reply-To: <20240515120713.00001904@yahoo.com> Bytes: 5685 On 5/15/2024 4:07 AM, Michael S wrote: > On Sun, 12 May 2024 15:30:40 +0200 > wolfgang kern <nowhere@never.at> wrote: > >> On 12/05/2024 05:44, John Savard wrote: >>> I've made another long-overdue change in the Concertina II >>> architecture on the page about 17-bit instructions. >>> >>> Since I describe the individual instructions there, with their >>> opcodes and what they do, I've illustrated the floating-point >>> formats of the architecture on that page. >>> >>> The good people in charge of the IEEE 754 standard had seen fit to >>> define a standard 128-bit floating-point format which included a >>> hidden first bit. >>> >>> This annoyed me greatly, because I was going to take the 8087's >>> temporary real format, and extend the mantissa for my 128-bit >>> format. >>> >>> I've decided that it's necessary to fully accept the 128-bit >>> standard and support it in a consistent manner. >>> >>> Therefore, I have taken the following actions: >>> >>> I have dropped the option of supporting 80-bit temporary reals >>> entirely, as they are now incompatible as an internal format. >>> >>> I have instead defined a 256-bit format for floats which does not >>> have a hidden first bit, which looks like the old temporary reals, >>> except that the exponent field is one bit wider. >>> >>> And in addition, just as the IBM 704 used two single-precision >>> floats to make a double-precision float, and the IBM System/360 >>> Model 85 started using two double-precision floats to make an >>> extended precision float... I've defined how the 256-bit internal >>> format floats can be doubled up to make a 512-bit float. >>> >>> I'm not really sure such floating-point precision is useful, but I >>> do remember some people telling me that higher float precision is >>> indeed something to be desired. Well, the IEEE 754 standard has >>> forced my hand. >> >> YES, I'd use something similar: >> I never cared nor supported any odd 10 byte formats and I give a fart >> to all these weird IEEE standards. >> > > I suppose, it's mutual. > In my case, I care about what the IEEE standard says to what extent it seems relevant and justified. In practice, this mostly means drawing a line in the sand for subnormal numbers and stuff that exists entirely in the sub ULP domain. If you need to roughly double the cost of the FPU for sake of a fraction of a bit of rounding accuracy, this does not seem justifiable. One could argue that determinism is important, but determinism could be achieved more cheaply via other means: Truncate rounding; Explicitly discarding low-order results. For some amount of fixed-point code, they reduce the issue of low-order results by doing right shifts before the multiply. So, rather than, say: z=(x*y)>>16; You have: z=(x>>8)*(y>>8); Though, effectively discarding half the mantissa on input for FMUL would be undesirable, as it would significantly reduce precision. Usually it is done in cases where speed matters more than accuracy, and a full-width multiply would require using a slower multiply internally (such as a "long long" multiply). In my ISA though, I partly addressed this scenario by adding widening 32-bit multiply ops (32*32->64). One could instead define the multiplier as, say: z=((x>>8)*(y>>8))+(((x&255)*(y>>8))>>8)+(((x>>8)*(y&255))>>8); But, granted, this still produces some intermediate low-order bits only to discard them. Though, this pattern isn't too far off from what my FPU uses (splitting it up among the DSP multipliers and a few smaller LUT-based multipliers to try to fudge the low-order bits). Annoyingly, the exact pattern for a strict-truncate would cost more to implement than its inexact constructions when dealing with hard-logic multipliers. For the most part, application code doesn't care... Though, if a program tries to use an Newton-Raphson loop that terminates when the result converges exactly, it will tend to get stuck in an infinite loop as exact convergence is never achieved. Typical workaround is to use a fixed number of loop iterations instead. Granted, will not claim it is a perfect solution, but mostly "good enough". The results from strict truncate would be more obvious IME, mostly in that calculations that feed back into themselves (and assume round-nearest) will tend to drift. But, for the most part, a "round the low 8 bits unless it would result in a carry" is also "good enough"...