Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Misc: Applications of small floating point formats.
Date: Sat, 3 Aug 2024 14:16:30 -0500
Organization: A noiseless patient Spider
Lines: 210
Message-ID: <v8lvml$3j9le$1@dont-email.me>
References: <v8ehgr$1q8sr$1@dont-email.me>
 <61e1f6f5f04ad043966b326d99e38928@www.novabbs.org>
 <v8ktu7$3d24l$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 03 Aug 2024 21:16:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9f70a5d3aff46a986f34274d40582209";
	logging-data="3778222"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+ahrQ5Q/Ivmjvy99WOmiyejW0YO27dEMU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ABoj0DW0q7/MJZ8LZFi+EmOkaCM=
Content-Language: en-US
In-Reply-To: <v8ktu7$3d24l$1@dont-email.me>
Bytes: 8872

On 8/3/2024 4:40 AM, Terje Mathisen wrote:
> MitchAlsup1 wrote:
>> On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:
>>
>>> So, say, we have common formats:
>>>    Binary64, S.E11.F52, Common Use
>>>    Binary32, S.E8.F23, Common Use
>>>    Binary16, S.E5.F10, Less Common Use
>>>
>>> But, things get funky below this:
>>>    A-Law: S.E3.F4 (Bias=8)
>>>    FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
>>>    FP8U: E4.F4 (Bias=7)
>>>    FP8S: E4.F3.S (Bias=7)
>>>
>>>
>>> Semi-absent in my case:
>>>    BFloat16: S.E8.F7
>>>      Can be faked in software in my case using Shuffle ops.
>>>    NVIDIA E5M2 (S.E5.F2)
>>>      Could be faked using RGBA32 pack/unpack ops.
>>
>> So, you have identified the problem:: 8-bits contains insufficient
>> exponent and fraction widths to be considered standard format.
>> Thus, in order to utilize 8-bit FP one needs several incarnations.
>> This just points back at the problem:: FP needs at least 10 bits.
> 
> I agree that fp10 is probably the shortest sane/useful version, but 
> 1:3:4 does in fact contain enough exponent and mantissa bits to be 
> considered an ieee754 format.
> 
> 3 exp bits means that you have 6 steps for regular/normal numbers, which 
> is enough to give some range.
> 
> 4 mantissa bits (with hidden bit of course) handles 
> zero/subnormal/normal/infinity/qnan/snan.
> 
> Afair the absolute limit is two mantissa bits in order to differentiate 
> between Inf/QNaN and SNaN, as well as two exp bits, so fp5 (1:2:2)
> 

Though, 1.3.4 is basically A-Law, though this format usually lacks both 
Inf/NaN and denormals; and is usually understood as either encoding a 
unit-range value or an integer value (when used for PCM).

One could use it with a bias of 4 rather than 8, giving:
   E=7, 8.000 .. 15.500
   E=6, 4.000 ..  7.750
   E=5, 2.000 ..  3.875
   E=4, 1.000 ..  1.938
   E=3, 0.500 ..  0.969
   E=2, 0.250 ..  0.485
   E=1, 0.125 ..  0.242
   E=0, 0.063 ..  0.121

Albeit, interpreting 0x00 as 0.000.

Or, with a Bias of 5:
   E=7, 4.000 ..  7.750
   ...
   E=1, 0.063 ..  0.121
   E=0, 0.032 ..  0.060

Which would allow it to cover the same dynamic range as RGB555 within 
unit-range.


Though, the plan for HDR in my case was to use FP8U:
   E4.F4, Bias=7  (Positive values only, negative clamps to 0)

Which (over a given dynamic range) gives quality comparable to RGB555.

Potentially also allows mostly using a similar rendering path to what 
one would use for LDR RGBA32/RGBA8888, just pulling tricks in a few 
places (such as blending operations).

Though, interpolating floating point values as integers does result in 
an S-Curve distortion that is more significant the further apart the 
values are (still TBD if this would be acceptable in a visual sense).

Still need to evaluate the cost of adding FP8U blend operators to the HW 
module (though, ironically, would be FP8U values expressed within 16-bit 
fixed-point numbers).

Will likely need to devise a module that basically tries to quickly do a 
low-precision A*B+C*D operation. I think I may have a few cycles to play 
with, as generally one needs to give a clock-edge for the DSP48 to do 
its thing.



Though, for "good quality" HDR rendering, one would generally need 
Binary16 or similar (and a floating-point pathway).


Though, OTOH, as to whether I could implement a GLSL compiler that gave 
acceptable performance, is unknown. Though, one intermediate possibility 
could be, rather than using GLSL or BJX2 assembly, making a sort of 
crude translator that converts ARB assembly into BJX2 machine code.

Though would be mildly inconvenient as the operations would likely need 
to translate between fixed-point and Binary16; which would require a 
type-system to keep track of this.

Either way, the use of shaders would need to fall back to the 
software-rasterization path (possibly slotting the shader function 
in-place of the Blend operator). Where, generally, TKRA-GL had combined 
both the Source and Destination blend operators into a single 
function-pointer.

If I were to implement shaders and a GLSL compiler, could jump from ~ GL 
1.2/1.3 territory up to GL 2.x ...

Some other 2.x features, like occlusion queries, have already been 
implemented (but, the shader compiler is the hard part in this case).


Though, ironically, if it supported shaders, and the shaders "didn't 
suck", it would be ahead of both of my laptops:
   2003 laptop: No shader support;
   2009 laptop:
     Shaders can be enabled in driver;
     Shader performance is unusable (immediately drops to a slide-show).


The 2009 laptop was a motivating factor in my first foray into 
software-rendered OpenGL, as ironically on a 2.1 GHz Intel Core based 
CPU, was not *that* much slower than the HW renderer.


Both laptops managed to have a commonality:
Both could run Half-Life and fell on their face for anything much newer 
(but, for seemingly different reasons).


Though, seemingly, it seems like the 2003 laptop could be faster than it 
seems to be.


I suspect it may be held back by the RAM:
On some tests, the factor by which its performance beats the BJX2 core, 
is closer to the ratio of memory bandwidth (rather than the ratio of 
clock speeds).

Side note: a lot of this is based on information "from memory", so no 
claims about accuracy.


The 2009 laptop has only 50% more MHz, but runs circles around the older 
laptop in terms of CPU side performance (and was hindered mostly by the 
seemingly terrible integrated GPU).

Like, while the CPU is 50% faster, the RAM seems to be around 5x faster 
(or, ~ 2GB/s memcpy vs ~ 400MB/s).

Though, loosely lines up with the stats on the RAM modules:
   BJX2 Core  : DDR2, 16-bit,  50 MHz, ~   55 MB/s
   2003 Laptop: DDR1, 64-bit, 100 MHz, ~  400 MB/s
   2009 Laptop: DDR3, 64-bit, 667 MHz, ~ 2000 MB/s

One would expect an 8x ratio between the BJX2 core and 2003 laptop, 
observation seems closer to 7x.

Both laptops have 2 DIMMs, but performance seems to match expectations 
from 1 DIMM.


Observations tend to undershoot the theoretical bandwidth, but usually 
in memcpy tests I don't see much than around 1/4 the theoretical 
bandwidth number.

Theoretical limit should be 50% for memcpy, 100% for memset, observation 
typically 25% for memcpy, 50% for memset.


Well, except on the BJX2 core where it seems to be higher than the 
theoretical estimate (may be an issue with the measurement), and gets 
========== REMAINDER OF ARTICLE TRUNCATED ==========