Article <v8m2gj$3jr42$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v8m2gj$3jr42$1@dont-email.me>

Deutsch English Français Italiano

<v8m2gj$3jr42$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Misc: Applications of small floating point formats.
Date: Sat, 3 Aug 2024 15:04:29 -0500
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <v8m2gj$3jr42$1@dont-email.me>
References: <v8ehgr$1q8sr$1@dont-email.me> <v8eloq$1qs1a$1@dont-email.me>
 <v8kuam$3d49g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 03 Aug 2024 22:04:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9f70a5d3aff46a986f34274d40582209";
	logging-data="3796098"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+AOGCnZ3Qsg3UK72aqK/MKBaQgpvNYmjU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:suvKlLgE+puID2a3930m6K5PewY=
In-Reply-To: <v8kuam$3d49g$1@dont-email.me>
Content-Language: en-US
Bytes: 3722

On 8/3/2024 4:47 AM, Terje Mathisen wrote:
> Lawrence D'Oliveiro wrote:
>> On Wed, 31 Jul 2024 18:31:35 -0500, BGB wrote:
>>
>>> Binary16 is useful for graphics and audio processing.
>>
>> The common format for CG work is OpenEXR, and that allows for 32-bit and
>> even 64-bit floats, per pixel component. So for example R-G-B-Alpha is 4
>> components.
>>
>>> The 8-bit formats get a bit more niche; main use-cases mostly to save
>>> memory.
>>
>> Heavily used in AI work.
> 
> The nicest property of fp8, as seen from a GPUs point of view, is that 
> arbitrary operations can be seen as texture map lookups. I don't think 
> that's how they are implemented but an 8x8->16 FMUL would only need a 
> few very small lookup tables, probably doable even on a regular CPU with 
> 16-element permute operations.
> 

At least for a 3-bit mantissa on FPGA, you can also stick the multiply 
directly into LUT6 lookups.

The widening FP8*FP8 -> FP16 SIMD multiply was cheap enough to seemingly 
"disappear in the noise" (can't easily check its LUT cost, as there is 
no obvious change in LUT cost for the FPGA).

My estimated cost (very crude): would be in the area of around 8 LUTs 
and 1 or 2 CARRY4's per FP8 operation (with 4 in a SIMD vector). Likely 
with more LUTs related to signal routing than actually calculating the 
value.

Would have been higher with a 4-bit mantissa though (core operator would 
likely need around 18 LUTs and 2 or 3 CARRY4s for the mantissa), with 
some additional LUTs and CARRY4's for the exponent and to compose the 
final result. Multiple strategies exist, but this is assuming a strategy 
of breaking the multiply into 2x2 bit pieces.

Another strategy could be splitting the high 3x3 multiply, and some 
conditional adders for the low-order results (though, could truncate the 
low-order results, maybe 8 LUTs and 2 CARRY4's for the mantissa).

Here, one could do a lookup for the LSB of each mantissa multiplied by 
the high 2 bits of the other, with the lookup table holding the sum of 
these two partial products. This is then added to the result of the 3x3 
lookup.

If the result is intended to be FP8U though (rather than Binary16), 
could save cost:
One would merely need to ADD or OR the AND of the LSB of each mantissa 
to the LSB of the result (while ADD is theoretically needed, seemingly 
in the cases checked with both MSB bits are 1, the output LSB bit is 0, 
so a CARRY4 may not be needed here).

In software on a CPU, one can do FMUL and FADD (for the full operation) 
with 64K lookup tables (or 128K if widening to Binary16).

> Terje
>