Deutsch English Français Italiano |
<v9nms2$1f34m$2@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Misc: Applications of small floating point formats. Date: Thu, 1 Aug 2024 00:31:51 +0000 Organization: Rocksolid Light Message-ID: <61e1f6f5f04ad043966b326d99e38928@www.novabbs.org> References: <v8ehgr$1q8sr$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="1040997"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$5hXRMaqrZUprHmnSr8sZbu.ua.g5pDPFjVIiamLKHRtpqPgIWkIjm Bytes: 14605 Lines: 324 On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote: > So, say, we have common formats: > Binary64, S.E11.F52, Common Use > Binary32, S.E8.F23, Common Use > Binary16, S.E5.F10, Less Common Use > > But, things get funky below this: > A-Law: S.E3.F4 (Bias=8) > FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms) > FP8U: E4.F4 (Bias=7) > FP8S: E4.F3.S (Bias=7) > > > Semi-absent in my case: > BFloat16: S.E8.F7 > Can be faked in software in my case using Shuffle ops. > NVIDIA E5M2 (S.E5.F2) > Could be faked using RGBA32 pack/unpack ops. So, you have identified the problem:: 8-bits contains insufficient exponent and fraction widths to be considered standard format. Thus, in order to utilize 8-bit FP one needs several incarnations. This just points back at the problem:: FP needs at least 10 bits. > > No immediate plans to add these later cases as (usually) I have a need > for more precision than more exponent range. The main seeming merit of > these formats being that they are truncated forms of the wider formats. > > > No need to elaborate on the use-cases for Binary32 and Binary64, wide > and varied. There is a growing clamor for 128-bit FP, too. > > > Binary16 is useful for graphics probably, > and audio processing. Insufficient data width as high quality Audio has gone to 24-bits {120 DBa S/N). You can call MP3 and other "phone" formats Audio, but please restrict yourself from using the term High Quality when doing so. > Seemingly IEEE > specifies it mostly for storage and not for computation, but for these > cases it is good enough for computation as well. > > Binary16 is mostly sufficient for 3D model geometry, and for small 3D > scenes, but not really for 3D computations or larger scenes (using it > for transform or projection matrices or matrix multiply does not give > acceptable results). > > Does work well for fast sin/cos lookup tables (if supported natively), > say, because the error of storing an angle as 1/256 of a circle is > larger than the error introduced by the 10 bit mantissa. > > I had also used it as the computational format in a lot of my neural-net > experiments. > I have seen NN used compressed FP formats where 0 uses 1-bit and 1.0 uses but 2-bits. ... > > The 8-bit formats get a bit more niche; main use-cases mostly to save > memory. > Sometimes power also. > > FP8s originally exists because it was cheaper to encode/decode alongside > FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has > been re-added. I couldn't simply entirely replace FP8S back with FP8, > partly as it seems my existing binaries depend on FP8S in a few places, > and so simply replacing it would have broken my existing binaries. > > So, options were to either add some separate ops for FP8, or just live > with using my wonky/non-standard FP8S format (or break my existing > binaries). Ended up deciding to re-add FP8. > Or don't do it that way. > > FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a > few other things. The variant used in my case is seemingly fairly > similar to that used by NVIDIA and PyTorch. If you are going to do an F8 make it compatible with OpenGL. > Unlike the minifloat format described on Wikipedia (which had defined it > as following IEEE 754 rules), it differs from IEEE rules in the handling > of large and small values. No separate Inf/NaN range, rather the largest > value serves as an implicit combined Inf/NaN, with the smallest value > understood as 0. > > The main difference here between FP8 and FP8S being the location of the > sign bit (putting it in the LSB initially allowed avoiding some MUX'ing > when paired with FP8U). > > > The re-added FP8 was instead overlapped with the unpack logic used for > A-Law (even with the obvious difference...). > > The encoder-side logic for FP8 can be implemented by taking the FP8S > output and reordering the bits (in an "assign"). Though, doing this on > the decoder input side would not likely have saved anything (attempts to > MUX on the input side seemingly tend to result in duplicating any LUTs > that follows afterwards). > > Though, one could almost argue for combining all 4 cases into shared > encoder/decoder modules (well, since at least 3/4 of the formats have > the mantissa and exponent bits in the same place, FP8 being the odd one > out; and A-Law being off-by-1 in terms of Bias). > That combination is well served with a single 10-bit FP format. > > This appears to be similar to what NV and PyTorch used, and also > overlaps with my handling of A-Law (though, the largest possible value > of A-Law is understood as ~ 0.969). > > Where, A-Law has slightly higher precision, but is normally limited to > unit range. Main use-case is in representing audio, but was sometimes > also used when a small unit-range format was needed and precision wasn't > a priority. > > For example, with slight fudging, it can be used to store > unit-quaternions, among other things. It is basically accurate enough to > store things like object orientations and 3D camera rotations. Though, > generally, it is needed to normalize the quaternion after unpacking it. > > > Ironically, for A-Law, my implementation and typical use differs from > how it is usually stored in WAV files, in that in WAV files it is > generally XOR'ed with 0x55, but this is an easy enough fix when loading > audio data or similar. > > There is also u-Law, but u-Law isn't really a minifloat format. > > > > These formats can also be used for pixel data; though FP8U often made > more sense for RGBA values (generally, negative RGBA isn't really a > thing). > > However, pixel values may go outside unit range, so A-Law doesn't work > for HDR pixel data. The use of FP8 or FP8S works, but gives lower > quality than FP8U. Here, FP8U gives slightly better quality than RGB555 > over LDR range, whereas FP8 or FP8S is slightly worse for bright values > (1 bit less accuracy between 0.5 and 1.0). > > > For normal bitmap graphics, I am mostly using RGB555 at present though. > > There isn't yet a fast conversion path between RGB555 and floating-point > formats, but, say: > RGB5UPCK64 //Unpack RGB555 to 4x WORD > PCVTUW2H //Packed Word to Half (1.0 .. 2.0) > PADD.H //To adjust DC bias to 0.0 .. 1.0. > ? PSTCM8UH //to FP8U (typical option for HDR RGBA pixel data) > ? PSTCF8H //to FP8 (newly added) > > > But, the crufty Word<->Half SIMD conversions exist mostly because it > would have been more expensive to support "better" SIMD converters (the > DC bias offset allowed doing the conversions via repacking the bits; > whereas unit-range conversions would have required the more expensive > path of adding the format conversion logic to the SIMD FADD units). > > Note that most of the SIMD format converters exist as applied use of > bit-twiddling (and generally no rounding or similar, as rounding would > add considerable amounts of cost here...). > > > Though, cases needing fast conversion of pixel data between RGB555 and > floating-point forms have been uncommon (most pixel math starting from > RGB555 tends to remain on the integer side of things). > > > If TKRA-GL were using HDR, most likely option here is: ========== REMAINDER OF ARTICLE TRUNCATED ==========