Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Misc: Applications of small floating point formats. Date: Sat, 3 Aug 2024 14:16:30 -0500 Organization: A noiseless patient Spider Lines: 210 Message-ID: References: <61e1f6f5f04ad043966b326d99e38928@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 03 Aug 2024 21:16:38 +0200 (CEST) Injection-Info: dont-email.me; posting-host="9f70a5d3aff46a986f34274d40582209"; logging-data="3778222"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+ahrQ5Q/Ivmjvy99WOmiyejW0YO27dEMU=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:ABoj0DW0q7/MJZ8LZFi+EmOkaCM= Content-Language: en-US In-Reply-To: Bytes: 8872 On 8/3/2024 4:40 AM, Terje Mathisen wrote: > MitchAlsup1 wrote: >> On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote: >> >>> So, say, we have common formats: >>>    Binary64, S.E11.F52, Common Use >>>    Binary32, S.E8.F23, Common Use >>>    Binary16, S.E5.F10, Less Common Use >>> >>> But, things get funky below this: >>>    A-Law: S.E3.F4 (Bias=8) >>>    FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms) >>>    FP8U: E4.F4 (Bias=7) >>>    FP8S: E4.F3.S (Bias=7) >>> >>> >>> Semi-absent in my case: >>>    BFloat16: S.E8.F7 >>>      Can be faked in software in my case using Shuffle ops. >>>    NVIDIA E5M2 (S.E5.F2) >>>      Could be faked using RGBA32 pack/unpack ops. >> >> So, you have identified the problem:: 8-bits contains insufficient >> exponent and fraction widths to be considered standard format. >> Thus, in order to utilize 8-bit FP one needs several incarnations. >> This just points back at the problem:: FP needs at least 10 bits. > > I agree that fp10 is probably the shortest sane/useful version, but > 1:3:4 does in fact contain enough exponent and mantissa bits to be > considered an ieee754 format. > > 3 exp bits means that you have 6 steps for regular/normal numbers, which > is enough to give some range. > > 4 mantissa bits (with hidden bit of course) handles > zero/subnormal/normal/infinity/qnan/snan. > > Afair the absolute limit is two mantissa bits in order to differentiate > between Inf/QNaN and SNaN, as well as two exp bits, so fp5 (1:2:2) > Though, 1.3.4 is basically A-Law, though this format usually lacks both Inf/NaN and denormals; and is usually understood as either encoding a unit-range value or an integer value (when used for PCM). One could use it with a bias of 4 rather than 8, giving: E=7, 8.000 .. 15.500 E=6, 4.000 .. 7.750 E=5, 2.000 .. 3.875 E=4, 1.000 .. 1.938 E=3, 0.500 .. 0.969 E=2, 0.250 .. 0.485 E=1, 0.125 .. 0.242 E=0, 0.063 .. 0.121 Albeit, interpreting 0x00 as 0.000. Or, with a Bias of 5: E=7, 4.000 .. 7.750 ... E=1, 0.063 .. 0.121 E=0, 0.032 .. 0.060 Which would allow it to cover the same dynamic range as RGB555 within unit-range. Though, the plan for HDR in my case was to use FP8U: E4.F4, Bias=7 (Positive values only, negative clamps to 0) Which (over a given dynamic range) gives quality comparable to RGB555. Potentially also allows mostly using a similar rendering path to what one would use for LDR RGBA32/RGBA8888, just pulling tricks in a few places (such as blending operations). Though, interpolating floating point values as integers does result in an S-Curve distortion that is more significant the further apart the values are (still TBD if this would be acceptable in a visual sense). Still need to evaluate the cost of adding FP8U blend operators to the HW module (though, ironically, would be FP8U values expressed within 16-bit fixed-point numbers). Will likely need to devise a module that basically tries to quickly do a low-precision A*B+C*D operation. I think I may have a few cycles to play with, as generally one needs to give a clock-edge for the DSP48 to do its thing. Though, for "good quality" HDR rendering, one would generally need Binary16 or similar (and a floating-point pathway). Though, OTOH, as to whether I could implement a GLSL compiler that gave acceptable performance, is unknown. Though, one intermediate possibility could be, rather than using GLSL or BJX2 assembly, making a sort of crude translator that converts ARB assembly into BJX2 machine code. Though would be mildly inconvenient as the operations would likely need to translate between fixed-point and Binary16; which would require a type-system to keep track of this. Either way, the use of shaders would need to fall back to the software-rasterization path (possibly slotting the shader function in-place of the Blend operator). Where, generally, TKRA-GL had combined both the Source and Destination blend operators into a single function-pointer. If I were to implement shaders and a GLSL compiler, could jump from ~ GL 1.2/1.3 territory up to GL 2.x ... Some other 2.x features, like occlusion queries, have already been implemented (but, the shader compiler is the hard part in this case). Though, ironically, if it supported shaders, and the shaders "didn't suck", it would be ahead of both of my laptops: 2003 laptop: No shader support; 2009 laptop: Shaders can be enabled in driver; Shader performance is unusable (immediately drops to a slide-show). The 2009 laptop was a motivating factor in my first foray into software-rendered OpenGL, as ironically on a 2.1 GHz Intel Core based CPU, was not *that* much slower than the HW renderer. Both laptops managed to have a commonality: Both could run Half-Life and fell on their face for anything much newer (but, for seemingly different reasons). Though, seemingly, it seems like the 2003 laptop could be faster than it seems to be. I suspect it may be held back by the RAM: On some tests, the factor by which its performance beats the BJX2 core, is closer to the ratio of memory bandwidth (rather than the ratio of clock speeds). Side note: a lot of this is based on information "from memory", so no claims about accuracy. The 2009 laptop has only 50% more MHz, but runs circles around the older laptop in terms of CPU side performance (and was hindered mostly by the seemingly terrible integrated GPU). Like, while the CPU is 50% faster, the RAM seems to be around 5x faster (or, ~ 2GB/s memcpy vs ~ 400MB/s). Though, loosely lines up with the stats on the RAM modules: BJX2 Core : DDR2, 16-bit, 50 MHz, ~ 55 MB/s 2003 Laptop: DDR1, 64-bit, 100 MHz, ~ 400 MB/s 2009 Laptop: DDR3, 64-bit, 667 MHz, ~ 2000 MB/s One would expect an 8x ratio between the BJX2 core and 2003 laptop, observation seems closer to 7x. Both laptops have 2 DIMMs, but performance seems to match expectations from 1 DIMM. Observations tend to undershoot the theoretical bandwidth, but usually in memcpy tests I don't see much than around 1/4 the theoretical bandwidth number. Theoretical limit should be 50% for memcpy, 100% for memset, observation typically 25% for memcpy, 50% for memset. Well, except on the BJX2 core where it seems to be higher than the theoretical estimate (may be an issue with the measurement), and gets ========== REMAINDER OF ARTICLE TRUNCATED ==========