Article <v9nms2$1f34m$2@dont-email.me>

Deutsch English Français Italiano
<v9nms2$1f34m$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Misc: Applications of small floating point formats.
Date: Thu, 1 Aug 2024 00:31:51 +0000
Organization: Rocksolid Light
Message-ID: <61e1f6f5f04ad043966b326d99e38928@www.novabbs.org>
References: <v8ehgr$1q8sr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="1040997"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$5hXRMaqrZUprHmnSr8sZbu.ua.g5pDPFjVIiamLKHRtpqPgIWkIjm
Bytes: 14605
Lines: 324

On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:

> So, say, we have common formats:
>    Binary64, S.E11.F52, Common Use
>    Binary32, S.E8.F23, Common Use
>    Binary16, S.E5.F10, Less Common Use
>
> But, things get funky below this:
>    A-Law: S.E3.F4 (Bias=8)
>    FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
>    FP8U: E4.F4 (Bias=7)
>    FP8S: E4.F3.S (Bias=7)
>
>
> Semi-absent in my case:
>    BFloat16: S.E8.F7
>      Can be faked in software in my case using Shuffle ops.
>    NVIDIA E5M2 (S.E5.F2)
>      Could be faked using RGBA32 pack/unpack ops.

So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format.
Thus, in order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.

>
> No immediate plans to add these later cases as (usually) I have a need
> for more precision than more exponent range. The main seeming merit of
> these formats being that they are truncated forms of the wider formats.
>
>
> No need to elaborate on the use-cases for Binary32 and Binary64, wide
> and varied.

There is a growing clamor for 128-bit FP, too.
>
>
> Binary16 is useful for graphics
probably,
>                                 and audio processing.

Insufficient data width as high quality Audio has gone to 24-bits
{120 DBa S/N).

You can call MP3 and other "phone" formats Audio, but please restrict
yourself from using the term High Quality when doing so.

>                                                       Seemingly IEEE
> specifies it mostly for storage and not for computation, but for these
> cases it is good enough for computation as well.
>
> Binary16 is mostly sufficient for 3D model geometry, and for small 3D
> scenes, but not really for 3D computations or larger scenes (using it
> for transform or projection matrices or matrix multiply does not give
> acceptable results).
>
> Does work well for fast sin/cos lookup tables (if supported natively),
> say, because the error of storing an angle as 1/256 of a circle is
> larger than the error introduced by the 10 bit mantissa.
>
> I had also used it as the computational format in a lot of my neural-net
> experiments.
>
I have seen NN used compressed FP formats where 0 uses 1-bit and
1.0 uses but 2-bits. ...
>
> The 8-bit formats get a bit more niche; main use-cases mostly to save
> memory.
>
Sometimes power also.
>
> FP8s originally exists because it was cheaper to encode/decode alongside
> FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has
> been re-added. I couldn't simply entirely replace FP8S back with FP8,
> partly as it seems my existing binaries depend on FP8S in a few places,
> and so simply replacing it would have broken my existing binaries.
>
> So, options were to either add some separate ops for FP8, or just live
> with using my wonky/non-standard FP8S format (or break my existing
> binaries). Ended up deciding to re-add FP8.
>
Or don't do it that way.
>
> FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a
> few other things. The variant used in my case is seemingly fairly
> similar to that used by NVIDIA and PyTorch.

If you are going to do an F8 make it compatible with OpenGL.

> Unlike the minifloat format described on Wikipedia (which had defined it
> as following IEEE 754 rules), it differs from IEEE rules in the handling
> of large and small values. No separate Inf/NaN range, rather the largest
> value serves as an implicit combined Inf/NaN, with the smallest value
> understood as 0.
>
> The main difference here between FP8 and FP8S being the location of the
> sign bit (putting it in the LSB initially allowed avoiding some MUX'ing
> when paired with FP8U).
>
>
> The re-added FP8 was instead overlapped with the unpack logic used for
> A-Law (even with the obvious difference...).
>
> The encoder-side logic for FP8 can be implemented by taking the FP8S
> output and reordering the bits (in an "assign"). Though, doing this on
> the decoder input side would not likely have saved anything (attempts to
> MUX on the input side seemingly tend to result in duplicating any LUTs
> that follows afterwards).
>
> Though, one could almost argue for combining all 4 cases into shared
> encoder/decoder modules (well, since at least 3/4 of the formats have
> the mantissa and exponent bits in the same place, FP8 being the odd one
> out; and A-Law being off-by-1 in terms of Bias).
>
That combination is well served with a single 10-bit FP format.
>
> This appears to be similar to what NV and PyTorch used, and also
> overlaps with my handling of A-Law (though, the largest possible value
> of A-Law is understood as ~ 0.969).
>
> Where, A-Law has slightly higher precision, but is normally limited to
> unit range. Main use-case is in representing audio, but was sometimes
> also used when a small unit-range format was needed and precision wasn't
> a priority.
>
> For example, with slight fudging, it can be used to store
> unit-quaternions, among other things. It is basically accurate enough to
> store things like object orientations and 3D camera rotations. Though,
> generally, it is needed to normalize the quaternion after unpacking it.
>
>
> Ironically, for A-Law, my implementation and typical use differs from
> how it is usually stored in WAV files, in that in WAV files it is
> generally XOR'ed with 0x55, but this is an easy enough fix when loading
> audio data or similar.
>
> There is also u-Law, but u-Law isn't really a minifloat format.
>
>
>
> These formats can also be used for pixel data; though FP8U often made
> more sense for RGBA values (generally, negative RGBA isn't really a
> thing).
>
> However, pixel values may go outside unit range, so A-Law doesn't work
> for HDR pixel data. The use of FP8 or FP8S works, but gives lower
> quality than FP8U. Here, FP8U gives slightly better quality than RGB555
> over LDR range, whereas FP8 or FP8S is slightly worse for bright values
> (1 bit less accuracy between 0.5 and 1.0).
>
>
> For normal bitmap graphics, I am mostly using RGB555 at present though.
>
> There isn't yet a fast conversion path between RGB555 and floating-point
> formats, but, say:
>    RGB5UPCK64  //Unpack RGB555 to 4x WORD
>    PCVTUW2H    //Packed Word to Half (1.0 .. 2.0)
>    PADD.H      //To adjust DC bias to 0.0 .. 1.0.
>    ? PSTCM8UH  //to FP8U (typical option for HDR RGBA pixel data)
>    ? PSTCF8H   //to FP8 (newly added)
>
>
> But, the crufty Word<->Half SIMD conversions exist mostly because it
> would have been more expensive to support "better" SIMD converters (the
> DC bias offset allowed doing the conversions via repacking the bits;
> whereas unit-range conversions would have required the more expensive
> path of adding the format conversion logic to the SIMD FADD units).
>
> Note that most of the SIMD format converters exist as applied use of
> bit-twiddling (and generally no rounding or similar, as rounding would
> add considerable amounts of cost here...).
>
>
> Though, cases needing fast conversion of pixel data between RGB555 and
> floating-point forms have been uncommon (most pixel math starting from
> RGB555 tends to remain on the integer side of things).
>
>
> If TKRA-GL were using HDR, most likely option here is:
========== REMAINDER OF ARTICLE TRUNCATED ==========