Sujet : Re: Misc: Applications of small floating point formats.
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 03. Aug 2024, 21:04:29
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v8m2gj$3jr42$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla Thunderbird
On 8/3/2024 4:47 AM, Terje Mathisen wrote:
Lawrence D'Oliveiro wrote:
On Wed, 31 Jul 2024 18:31:35 -0500, BGB wrote:
>
Binary16 is useful for graphics and audio processing.
>
The common format for CG work is OpenEXR, and that allows for 32-bit and
even 64-bit floats, per pixel component. So for example R-G-B-Alpha is 4
components.
>
The 8-bit formats get a bit more niche; main use-cases mostly to save
memory.
>
Heavily used in AI work.
The nicest property of fp8, as seen from a GPUs point of view, is that arbitrary operations can be seen as texture map lookups. I don't think that's how they are implemented but an 8x8->16 FMUL would only need a few very small lookup tables, probably doable even on a regular CPU with 16-element permute operations.
At least for a 3-bit mantissa on FPGA, you can also stick the multiply directly into LUT6 lookups.
The widening FP8*FP8 -> FP16 SIMD multiply was cheap enough to seemingly "disappear in the noise" (can't easily check its LUT cost, as there is no obvious change in LUT cost for the FPGA).
My estimated cost (very crude): would be in the area of around 8 LUTs and 1 or 2 CARRY4's per FP8 operation (with 4 in a SIMD vector). Likely with more LUTs related to signal routing than actually calculating the value.
Would have been higher with a 4-bit mantissa though (core operator would likely need around 18 LUTs and 2 or 3 CARRY4s for the mantissa), with some additional LUTs and CARRY4's for the exponent and to compose the final result. Multiple strategies exist, but this is assuming a strategy of breaking the multiply into 2x2 bit pieces.
Another strategy could be splitting the high 3x3 multiply, and some conditional adders for the low-order results (though, could truncate the low-order results, maybe 8 LUTs and 2 CARRY4's for the mantissa).
Here, one could do a lookup for the LSB of each mantissa multiplied by the high 2 bits of the other, with the lookup table holding the sum of these two partial products. This is then added to the result of the 3x3 lookup.
If the result is intended to be FP8U though (rather than Binary16), could save cost:
One would merely need to ADD or OR the AND of the LSB of each mantissa to the LSB of the result (while ADD is theoretically needed, seemingly in the cases checked with both MSB bits are 1, the output LSB bit is 0, so a CARRY4 may not be needed here).
In software on a CPU, one can do FMUL and FADD (for the full operation) with 64K lookup tables (or 128K if widening to Binary16).
Terje