Liste des Groupes | Revenir à c arch |
On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:Though, 10 bits only gives 3 components per 32-bit word, or would need 5 bytes for 4 components, neither is ideal...
So, say, we have common formats:So, you have identified the problem:: 8-bits contains insufficient
Binary64, S.E11.F52, Common Use
Binary32, S.E8.F23, Common Use
Binary16, S.E5.F10, Less Common Use
>
But, things get funky below this:
A-Law: S.E3.F4 (Bias=8)
FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
FP8U: E4.F4 (Bias=7)
FP8S: E4.F3.S (Bias=7)
>
>
Semi-absent in my case:
BFloat16: S.E8.F7
Can be faked in software in my case using Shuffle ops.
NVIDIA E5M2 (S.E5.F2)
Could be faked using RGBA32 pack/unpack ops.
exponent and fraction widths to be considered standard format.
Thus, in order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.
Supported in my case, but currently software only (as "long double").>There is a growing clamor for 128-bit FP, too.
No immediate plans to add these later cases as (usually) I have a need
for more precision than more exponent range. The main seeming merit of
these formats being that they are truncated forms of the wider formats.
>
>
No need to elaborate on the use-cases for Binary32 and Binary64, wide
and varied.
Usually "gold standard" audio format IME are 44100Hz and 48000Hz 16-bit stereo. Personally, I don't notice much difference between 44kHz and 48kHz.>probably,
>
Binary16 is useful for graphicsand audio processing.Insufficient data width as high quality Audio has gone to 24-bits
{120 DBa S/N).
You can call MP3 and other "phone" formats Audio, but please restrict
yourself from using the term High Quality when doing so.
I had my experimental BITNN thing, where:Seemingly IEEEI have seen NN used compressed FP formats where 0 uses 1-bit and
specifies it mostly for storage and not for computation, but for these
cases it is good enough for computation as well.
>
Binary16 is mostly sufficient for 3D model geometry, and for small 3D
scenes, but not really for 3D computations or larger scenes (using it
for transform or projection matrices or matrix multiply does not give
acceptable results).
>
Does work well for fast sin/cos lookup tables (if supported natively),
say, because the error of storing an angle as 1/256 of a circle is
larger than the error introduced by the 10 bit mantissa.
>
I had also used it as the computational format in a lot of my neural-net
experiments.
>
1.0 uses but 2-bits. ...
Probably.>Sometimes power also.
The 8-bit formats get a bit more niche; main use-cases mostly to save
memory.
>
Better option?...>Or don't do it that way.
FP8s originally exists because it was cheaper to encode/decode alongside
FP8U, vs traditional FP8. Originally, FP8S replaced FP8, but now FP8 has
been re-added. I couldn't simply entirely replace FP8S back with FP8,
partly as it seems my existing binaries depend on FP8S in a few places,
and so simply replacing it would have broken my existing binaries.
>
So, options were to either add some separate ops for FP8, or just live
with using my wonky/non-standard FP8S format (or break my existing
binaries). Ended up deciding to re-add FP8.
>
OpenGL doesn't have FP8.>If you are going to do an F8 make it compatible with OpenGL.
FP8 is used apparently by NVIDIA GPUs, also apparently by PyTorch and a
few other things. The variant used in my case is seemingly fairly
similar to that used by NVIDIA and PyTorch.
Much easier to have multiple 8-bit FP formats than to try to somehow shove 2 more bits into a byte...Unlike the minifloat format described on Wikipedia (which had defined itThat combination is well served with a single 10-bit FP format.
as following IEEE 754 rules), it differs from IEEE rules in the handling
of large and small values. No separate Inf/NaN range, rather the largest
value serves as an implicit combined Inf/NaN, with the smallest value
understood as 0.
>
The main difference here between FP8 and FP8S being the location of the
sign bit (putting it in the LSB initially allowed avoiding some MUX'ing
when paired with FP8U).
>
>
The re-added FP8 was instead overlapped with the unpack logic used for
A-Law (even with the obvious difference...).
>
The encoder-side logic for FP8 can be implemented by taking the FP8S
output and reordering the bits (in an "assign"). Though, doing this on
the decoder input side would not likely have saved anything (attempts to
MUX on the input side seemingly tend to result in duplicating any LUTs
that follows afterwards).
>
Though, one could almost argue for combining all 4 cases into shared
encoder/decoder modules (well, since at least 3/4 of the formats have
the mantissa and exponent bits in the same place, FP8 being the odd one
out; and A-Law being off-by-1 in terms of Bias).
>
>It sounds more like computer automated speaking to me--Oh Wait--that
This appears to be similar to what NV and PyTorch used, and also
overlaps with my handling of A-Law (though, the largest possible value
of A-Law is understood as ~ 0.969).
>
Where, A-Law has slightly higher precision, but is normally limited to
unit range. Main use-case is in representing audio, but was sometimes
also used when a small unit-range format was needed and precision wasn't
a priority.
>
For example, with slight fudging, it can be used to store
unit-quaternions, among other things. It is basically accurate enough to
store things like object orientations and 3D camera rotations. Though,
generally, it is needed to normalize the quaternion after unpacking it.
>
>
Ironically, for A-Law, my implementation and typical use differs from
how it is usually stored in WAV files, in that in WAV files it is
generally XOR'ed with 0x55, but this is an easy enough fix when loading
audio data or similar.
>
There is also u-Law, but u-Law isn't really a minifloat format.
>
>
>
These formats can also be used for pixel data; though FP8U often made
more sense for RGBA values (generally, negative RGBA isn't really a
thing).
>
However, pixel values may go outside unit range, so A-Law doesn't work
for HDR pixel data. The use of FP8 or FP8S works, but gives lower
quality than FP8U. Here, FP8U gives slightly better quality than RGB555
over LDR range, whereas FP8 or FP8S is slightly worse for bright values
(1 bit less accuracy between 0.5 and 1.0).
>
>
For normal bitmap graphics, I am mostly using RGB555 at present though.
>
There isn't yet a fast conversion path between RGB555 and floating-point
formats, but, say:
RGB5UPCK64 //Unpack RGB555 to 4x WORD
PCVTUW2H //Packed Word to Half (1.0 .. 2.0)
PADD.H //To adjust DC bias to 0.0 .. 1.0.
? PSTCM8UH //to FP8U (typical option for HDR RGBA pixel data)
? PSTCF8H //to FP8 (newly added)
>
>
But, the crufty Word<->Half SIMD conversions exist mostly because it
would have been more expensive to support "better" SIMD converters (the
DC bias offset allowed doing the conversions via repacking the bits;
whereas unit-range conversions would have required the more expensive
path of adding the format conversion logic to the SIMD FADD units).
>
Note that most of the SIMD format converters exist as applied use of
bit-twiddling (and generally no rounding or similar, as rounding would
add considerable amounts of cost here...).
>
>
Though, cases needing fast conversion of pixel data between RGB555 and
floating-point forms have been uncommon (most pixel math starting from
RGB555 tends to remain on the integer side of things).
>
>
If TKRA-GL were using HDR, most likely option here is:
If HDR is used;
The program binds an LDR texture.
>
The GL backend can internally quietly generate an HDR version of the
texture and use this instead; as opposed to trying to dynamically
transform RGB555 or UTX2 into HDR during texel load.
>
Though, another option would be to base it on the GL context:
If the OpenGL framebuffer is HDR;
All uploaded textures get converted to HDR formats as well.
So, RGB555/RGBA8888/... -> FP8U, and DXT1/DXT5/BC6H/BC7 -> UTX3.
>
....
>
>
>
For things like streaming PCM audio to the audio hardware, say:
2x PSHUF.W+MOVxxD //Shuffle from Stereo to 1x/2x Mono
PCVTSW2H //Packed Word to Half (2.0 .. 4.0)
PADD.H //To adjust DC bias to -1.0 .. 1.0.
PCVTH2AL //Convert Half to A-Law
>
Where, the programs and API use 16-bit stereo PCM, and my audio hardware
generally uses separate Left/Right A-Law for the loop buffers.
>
A-Law was used mostly because:
8-bit linear PCM tends to sound like garbage;
does sound like Garbage:: Sorry !!
Still pretty much standard though...16-bit PCM needs twice the Block-RAM (relative to sample rate);16-bit Audio is so 1990.....
I was more going for "works, doesn't sound like crap".A-Law quality is closer to 16-bit PCM, at the same size as 8-bit PCM.Audio is supposed to sound like you were there listening to it live in
So, I ended up designing the audio hardware to use A-Law.
But, on a 50MHz CPU, the CPU is slow enough that one has reason to care
about how many clock-cycles are used by the audio encoding (doing it in
software was slow; so ended up doing it via SIMD).
>
Generally, most audio mixing code has tended to use 16-bit PCM, as using
Binary16 or Binary32 for front-end audio mixing is a bit of a novelty.
Wouldn't be that hard to support in theory, would just need to be
expressed via the WAVEFORMATEX structure (and, assuming the backend code
was added to support raw floating-point PCM).
>
The API does also support 8-bit PCM, but this is the worst case quality
wise (combining both the initial poorness of 8-bit PCM with some
additional information loss in the conversion to A-Law).
Though, 8-bit PCM is still acceptable for use in sound-effects and
similar. When mixed into a PCM buffer, typically amplitude and DC bias
is all over the place.
>
>
Had (early on) experimented with possible "small block" audio
compression (and an ADPCM variant) for the audio hardware, but couldn't
really get acceptable results. A-Law seemed to be the most reasonable
compromise (in terms of encoding cost and "didn't sound like crap").
a building designed for its acoustics.....but alas...
You don't use them for direct computation, but rather as storage formats (and for constant data).>So why support them ?
While ADPCM can give OK quality relative to size, it was a rather poor
fit for the use-case (it is much better as an "offline" audio storage
format).
>
>
>
These 8-bit floating-point formats are generally too poor in terms of
quality to be used for direct computation in SIMD operations.
Not really gonna happen in the BJX2 core...>What several NN architectures do is to use a 256-bit word and then
Some stuff online implies that FP8 could be used as an intermediate
computation format in neural nets, but my own past experiments in these
areas implied that FP8 was insufficient (for good results, one seemingly
needs around 7 or 8 bits for the mantissa).
decode it into multiple F8 or F10 or F12 components using a Huffman
coding scheme. 0 takes 1-bit 1.0 takes 2, leaving lots of bits for
other mantissas. These were done to same memory BW not particulary
size but raw aggregated BW.
Possibly.>Run it as a GPGPU
Granted, this was mostly with trying to use NN's for image processing
tasks (which likely have higher accuracy requirements than, say, LLMs).
>
However, FP8 can work OK for weights. Some experiments had used A-Law,
but I can note that A-Law requires to add an extra scaling step before
adding the bias and invoking an activation function (this could be
avoided with FP8).
>
For image-filtering NNs, seems to be better to work primarily using
Binary16 and ReLU activation or similar.
>
Though, the "approximate ssqrt" can work OK (where approximate ssqrt is
roughly comparable to "tanh", but cheaper to calculate). The
"approximate" part being that, by usual definition, one can leave off
the Newton-Raphson iteration stages.
>
Well, in a similar way to how, in graphics processing, it can sometimes
be useful to redefine Binary16 divide "A/B" as roughly "A*(0x7800-B)"
(if the speed of the divide matters more than the accuracy of the
result).
>
Though, generally makes sense to train the net with the same range and
precision intended to run it (so, if it is going to be run as Binary16
with approximate operators, it also needs to be trained using Binary16
and approximate operators).
>
>
Though, moderately annoying for "normal C on a desktop PC", as both
Binary16 and FP8 are absent and will need to be faked in software.
>
Dunno...>Stuff like this falls out "for free" under VVM.
>
Ended up going with FP8 for a "packed multiply expanding" instruction:
PMUL.F8H Rs, Rt, Rn
Where, each FP8 in Rs and Rt is multiplied, and the result expands to a
Binary16 element in Rn.>Architecture is as much about what gets left out as what gets put in.
Ended up not going with FMAC, as it is likely the cost and latency would
have been a bit higher than I would like (and possibly higher than the
"inline shuffle" experiment).
>
The "PMUL.F8H" instruction was added with a 2-cycle latency, and seems
to have a moderately low cost (no obvious impact on overall LUT costs).
However, its logic is still complicated enough that I wouldn't want to
try adding it as a 1-cycle operation.
>
As one merit of using FP8, the 3-bit mantissa is small enough that the
pair of mantissas can directly use LUT6 lookups (and most of the cost is
likely along the exponent path).
>
But, don't know if this would have much use out of potentially being
useful for neural-net code.
>
>
>
But, any thoughts?...
Les messages affichés proviennent d'usenet.