Newsportal USENET - Misc: Applications of small floating point formats.

So, say, we have common formats:
   Binary64, S.E11.F52, Common Use
   Binary32, S.E8.F23, Common Use
   Binary16, S.E5.F10, Less Common Use
But, things get funky below this:
   A-Law: S.E3.F4 (Bias=8)
   FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
   FP8U: E4.F4 (Bias=7)
   FP8S: E4.F3.S (Bias=7)
Semi-absent in my case:
   BFloat16: S.E8.F7
   Can be faked in software in my case using Shuffle ops.
   NVIDIA E5M2 (S.E5.F2)
   Could be faked using RGBA32 pack/unpack ops.
No immediate plans to add these later cases as (usually) I have a need for more precision than more exponent range. The main seeming merit of these formats being that they are truncated forms of the wider formats.
No need to elaborate on the use-cases for Binary32 and Binary64, wide and varied.
Binary16 is useful for graphics and audio processing. Seemingly IEEE specifies it mostly for storage and not for computation, but for these cases it is good enough for computation as well.
Binary16 is mostly sufficient for 3D model geometry, and for small 3D scenes, but not really for 3D computations or larger scenes (using it for transform or projection matrices or matrix multiply does not give acceptable results).
Does work well for fast sin/cos lookup tables (if supported natively), say, because the error of storing an angle as 1/256 of a circle is larger than the error introduced by the 10 bit mantissa.
I had also used it as the computational format in a lot of my neural-net experiments.
The 8-bit formats get a bit more niche; main use-cases mostly to save memory.
FP8s originally exists because it was cheaper to encode/decode alongside FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has been re-added. I couldn't simply entirely replace FP8S back with FP8, partly as it seems my existing binaries depend on FP8S in a few places, and so simply replacing it would have broken my existing binaries.
So, options were to either add some separate ops for FP8, or just live with using my wonky/non-standard FP8S format (or break my existing binaries). Ended up deciding to re-add FP8.
FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a few other things. The variant used in my case is seemingly fairly similar to that used by NVIDIA and PyTorch.
Unlike the minifloat format described on Wikipedia (which had defined it as following IEEE 754 rules), it differs from IEEE rules in the handling of large and small values. No separate Inf/NaN range, rather the largest value serves as an implicit combined Inf/NaN, with the smallest value understood as 0.
The main difference here between FP8 and FP8S being the location of the sign bit (putting it in the LSB initially allowed avoiding some MUX'ing when paired with FP8U).
The re-added FP8 was instead overlapped with the unpack logic used for A-Law (even with the obvious difference...).
The encoder-side logic for FP8 can be implemented by taking the FP8S output and reordering the bits (in an "assign"). Though, doing this on the decoder input side would not likely have saved anything (attempts to MUX on the input side seemingly tend to result in duplicating any LUTs that follows afterwards).
Though, one could almost argue for combining all 4 cases into shared encoder/decoder modules (well, since at least 3/4 of the formats have the mantissa and exponent bits in the same place, FP8 being the odd one out; and A-Law being off-by-1 in terms of Bias).
This appears to be similar to what NV and PyTorch used, and also overlaps with my handling of A-Law (though, the largest possible value of A-Law is understood as ~ 0.969).
Where, A-Law has slightly higher precision, but is normally limited to unit range. Main use-case is in representing audio, but was sometimes also used when a small unit-range format was needed and precision wasn't a priority.
For example, with slight fudging, it can be used to store unit-quaternions, among other things. It is basically accurate enough to store things like object orientations and 3D camera rotations. Though, generally, it is needed to normalize the quaternion after unpacking it.
Ironically, for A-Law, my implementation and typical use differs from how it is usually stored in WAV files, in that in WAV files it is generally XOR'ed with 0x55, but this is an easy enough fix when loading audio data or similar.
There is also u-Law, but u-Law isn't really a minifloat format.
These formats can also be used for pixel data; though FP8U often made more sense for RGBA values (generally, negative RGBA isn't really a thing).
However, pixel values may go outside unit range, so A-Law doesn't work for HDR pixel data. The use of FP8 or FP8S works, but gives lower quality than FP8U. Here, FP8U gives slightly better quality than RGB555 over LDR range, whereas FP8 or FP8S is slightly worse for bright values (1 bit less accuracy between 0.5 and 1.0).
For normal bitmap graphics, I am mostly using RGB555 at present though.
There isn't yet a fast conversion path between RGB555 and floating-point formats, but, say:
   RGB5UPCK64 //Unpack RGB555 to 4x WORD
   PCVTUW2H //Packed Word to Half (1.0 .. 2.0)
   PADD.H //To adjust DC bias to 0.0 .. 1.0.
   ? PSTCM8UH //to FP8U (typical option for HDR RGBA pixel data)
   ? PSTCF8H   //to FP8 (newly added)
But, the crufty Word<->Half SIMD conversions exist mostly because it would have been more expensive to support "better" SIMD converters (the DC bias offset allowed doing the conversions via repacking the bits; whereas unit-range conversions would have required the more expensive path of adding the format conversion logic to the SIMD FADD units).
Note that most of the SIMD format converters exist as applied use of bit-twiddling (and generally no rounding or similar, as rounding would add considerable amounts of cost here...).
Though, cases needing fast conversion of pixel data between RGB555 and floating-point forms have been uncommon (most pixel math starting from RGB555 tends to remain on the integer side of things).
If TKRA-GL were using HDR, most likely option here is:
   If HDR is used;
   The program binds an LDR texture.
The GL backend can internally quietly generate an HDR version of the texture and use this instead; as opposed to trying to dynamically transform RGB555 or UTX2 into HDR during texel load.
Though, another option would be to base it on the GL context:
   If the OpenGL framebuffer is HDR;
   All uploaded textures get converted to HDR formats as well.
   So, RGB555/RGBA8888/... -> FP8U, and DXT1/DXT5/BC6H/BC7 -> UTX3.
...
For things like streaming PCM audio to the audio hardware, say:
   2x PSHUF.W+MOVxxD    //Shuffle from Stereo to 1x/2x Mono
   PCVTSW2H //Packed Word to Half (2.0 .. 4.0)
   PADD.H //To adjust DC bias to -1.0 .. 1.0.
   PCVTH2AL //Convert Half to A-Law
Where, the programs and API use 16-bit stereo PCM, and my audio hardware generally uses separate Left/Right A-Law for the loop buffers.
A-Law was used mostly because:
   8-bit linear PCM tends to sound like garbage;
   16-bit PCM needs twice the Block-RAM (relative to sample rate);
   A-Law quality is closer to 16-bit PCM, at the same size as 8-bit PCM.
So, I ended up designing the audio hardware to use A-Law.
But, on a 50MHz CPU, the CPU is slow enough that one has reason to care about how many clock-cycles are used by the audio encoding (doing it in software was slow; so ended up doing it via SIMD).
Generally, most audio mixing code has tended to use 16-bit PCM, as using Binary16 or Binary32 for front-end audio mixing is a bit of a novelty. Wouldn't be that hard to support in theory, would just need to be expressed via the WAVEFORMATEX structure (and, assuming the backend code was added to support raw floating-point PCM).
The API does also support 8-bit PCM, but this is the worst case quality wise (combining both the initial poorness of 8-bit PCM with some additional information loss in the conversion to A-Law).
Though, 8-bit PCM is still acceptable for use in sound-effects and similar. When mixed into a PCM buffer, typically amplitude and DC bias is all over the place.
Had (early on) experimented with possible "small block" audio compression (and an ADPCM variant) for the audio hardware, but couldn't really get acceptable results. A-Law seemed to be the most reasonable compromise (in terms of encoding cost and "didn't sound like crap").
While ADPCM can give OK quality relative to size, it was a rather poor fit for the use-case (it is much better as an "offline" audio storage format).
These 8-bit floating-point formats are generally too poor in terms of quality to be used for direct computation in SIMD operations.
Some stuff online implies that FP8 could be used as an intermediate computation format in neural nets, but my own past experiments in these areas implied that FP8 was insufficient (for good results, one seemingly needs around 7 or 8 bits for the mantissa).
Granted, this was mostly with trying to use NN's for image processing tasks (which likely have higher accuracy requirements than, say, LLMs).
However, FP8 can work OK for weights. Some experiments had used A-Law, but I can note that A-Law requires to add an extra scaling step before adding the bias and invoking an activation function (this could be avoided with FP8).
For image-filtering NNs, seems to be better to work primarily using Binary16 and ReLU activation or similar.
Though, the "approximate ssqrt" can work OK (where approximate ssqrt is roughly comparable to "tanh", but cheaper to calculate). The "approximate" part being that, by usual definition, one can leave off the Newton-Raphson iteration stages.
Well, in a similar way to how, in graphics processing, it can sometimes be useful to redefine Binary16 divide "A/B" as roughly "A*(0x7800-B)" (if the speed of the divide matters more than the accuracy of the result).
Though, generally makes sense to train the net with the same range and precision intended to run it (so, if it is going to be run as Binary16 with approximate operators, it also needs to be trained using Binary16 and approximate operators).
Though, moderately annoying for "normal C on a desktop PC", as both Binary16 and FP8 are absent and will need to be faked in software.
Ended up going with FP8 for a "packed multiply expanding" instruction:
   PMUL.F8H Rs, Rt, Rn
Where, each FP8 in Rs and Rt is multiplied, and the result expands to a Binary16 element in Rn.
Ended up not going with FMAC, as it is likely the cost and latency would have been a bit higher than I would like (and possibly higher than the "inline shuffle" experiment).
The "PMUL.F8H" instruction was added with a 2-cycle latency, and seems to have a moderately low cost (no obvious impact on overall LUT costs). However, its logic is still complicated enough that I wouldn't want to try adding it as a 1-cycle operation.
As one merit of using FP8, the 3-bit mantissa is small enough that the pair of mantissas can directly use LUT6 lookups (and most of the cost is likely along the exponent path).
But, don't know if this would have much use out of potentially being useful for neural-net code.
But, any thoughts?...

Date	Sujet	#	Auteur
1 Aug 24	Misc: Applications of small floating point formats.	47	BGB
1 Aug 24	Re: Misc: Applications of small floating point formats.	12	MitchAlsup1
1 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB
2 Aug 24	Re: Misc: Applications of small floating point formats.	1	MitchAlsup1
2 Aug 24	Re: Misc: Applications of small floating point formats.	2	Thomas Koenig
2 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	7	Terje Mathisen
3 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	5	Lawrence D'Oliveiro
5 Aug 24	Re: Misc: Applications of small floating point formats.	4	George Neuner
5 Aug 24	Re: Misc: Applications of small floating point formats.	3	BGB-Alt
6 Aug 24	Re: Misc: Applications of small floating point formats.	2	George Neuner
6 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB
1 Aug 24	Re: Misc: Applications of small floating point formats.	34	Lawrence D'Oliveiro
1 Aug 24	Re: Misc: Applications of small floating point formats.	31	BGB
2 Aug 24	Re: Misc: Applications of small floating point formats.	30	Lawrence D'Oliveiro
2 Aug 24	Re: Misc: Applications of small floating point formats.	29	BGB
2 Aug 24	Re: Misc: Applications of small floating point formats.	28	Lawrence D'Oliveiro
2 Aug 24	Re: Misc: Applications of small floating point formats.	27	BGB
2 Aug 24	Re: Misc: Applications of small floating point formats.	26	BGB
2 Aug 24	Re: Misc: Applications of small floating point formats.	25	Lawrence D'Oliveiro
2 Aug 24	Re: Misc: Applications of small floating point formats.	24	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	23	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	11	Chris M. Thomasson
3 Aug 24	Re: Misc: Applications of small floating point formats.	7	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	6	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	5	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	4	Chris M. Thomasson
3 Aug 24	Re: Misc: Applications of small floating point formats.	3	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	2	Chris M. Thomasson
4 Aug 24	Re: Misc: Applications of small floating point formats.	1	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	3	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	2	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	11	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	10	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	9	BGB
3 Aug 24	Re: Misc: Applications of small floating point formats.	8	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	7	Chris M. Thomasson
4 Aug 24	Re: Misc: Applications of small floating point formats.	6	Lawrence D'Oliveiro
4 Aug 24	Re: Misc: Applications of small floating point formats.	5	Chris M. Thomasson
4 Aug 24	Re: Misc: Applications of small floating point formats.	4	BGB
4 Aug 24	Re: Misc: Applications of small floating point formats.	2	Chris M. Thomasson
4 Aug 24	Re: Misc: Applications of small floating point formats.	1	Chris M. Thomasson
5 Aug 24	Re: Misc: Applications of small floating point formats.	1	Lawrence D'Oliveiro
3 Aug 24	Re: Misc: Applications of small floating point formats.	2	Terje Mathisen
3 Aug 24	Re: Misc: Applications of small floating point formats.	1	BGB