On 8/3/2024 4:40 AM, Terje Mathisen wrote:
MitchAlsup1 wrote:
On Wed, 31 Jul 2024 23:31:35 +0000, BGB wrote:
>
So, say, we have common formats:
Binary64, S.E11.F52, Common Use
Binary32, S.E8.F23, Common Use
Binary16, S.E5.F10, Less Common Use
>
But, things get funky below this:
A-Law: S.E3.F4 (Bias=8)
FP8: S.E4.F3 (Bias=7) (E4M3 in NVIDIA terms)
FP8U: E4.F4 (Bias=7)
FP8S: E4.F3.S (Bias=7)
>
>
Semi-absent in my case:
BFloat16: S.E8.F7
Can be faked in software in my case using Shuffle ops.
NVIDIA E5M2 (S.E5.F2)
Could be faked using RGBA32 pack/unpack ops.
>
So, you have identified the problem:: 8-bits contains insufficient
exponent and fraction widths to be considered standard format.
Thus, in order to utilize 8-bit FP one needs several incarnations.
This just points back at the problem:: FP needs at least 10 bits.
I agree that fp10 is probably the shortest sane/useful version, but 1:3:4 does in fact contain enough exponent and mantissa bits to be considered an ieee754 format.
3 exp bits means that you have 6 steps for regular/normal numbers, which is enough to give some range.
4 mantissa bits (with hidden bit of course) handles zero/subnormal/normal/infinity/qnan/snan.
Afair the absolute limit is two mantissa bits in order to differentiate between Inf/QNaN and SNaN, as well as two exp bits, so fp5 (1:2:2)
Though, 1.3.4 is basically A-Law, though this format usually lacks both Inf/NaN and denormals; and is usually understood as either encoding a unit-range value or an integer value (when used for PCM).
One could use it with a bias of 4 rather than 8, giving:
E=7, 8.000 .. 15.500
E=6, 4.000 .. 7.750
E=5, 2.000 .. 3.875
E=4, 1.000 .. 1.938
E=3, 0.500 .. 0.969
E=2, 0.250 .. 0.485
E=1, 0.125 .. 0.242
E=0, 0.063 .. 0.121
Albeit, interpreting 0x00 as 0.000.
Or, with a Bias of 5:
E=7, 4.000 .. 7.750
...
E=1, 0.063 .. 0.121
E=0, 0.032 .. 0.060
Which would allow it to cover the same dynamic range as RGB555 within unit-range.
Though, the plan for HDR in my case was to use FP8U:
E4.F4, Bias=7 (Positive values only, negative clamps to 0)
Which (over a given dynamic range) gives quality comparable to RGB555.
Potentially also allows mostly using a similar rendering path to what one would use for LDR RGBA32/RGBA8888, just pulling tricks in a few places (such as blending operations).
Though, interpolating floating point values as integers does result in an S-Curve distortion that is more significant the further apart the values are (still TBD if this would be acceptable in a visual sense).
Still need to evaluate the cost of adding FP8U blend operators to the HW module (though, ironically, would be FP8U values expressed within 16-bit fixed-point numbers).
Will likely need to devise a module that basically tries to quickly do a low-precision A*B+C*D operation. I think I may have a few cycles to play with, as generally one needs to give a clock-edge for the DSP48 to do its thing.
Though, for "good quality" HDR rendering, one would generally need Binary16 or similar (and a floating-point pathway).
Though, OTOH, as to whether I could implement a GLSL compiler that gave acceptable performance, is unknown. Though, one intermediate possibility could be, rather than using GLSL or BJX2 assembly, making a sort of crude translator that converts ARB assembly into BJX2 machine code.
Though would be mildly inconvenient as the operations would likely need to translate between fixed-point and Binary16; which would require a type-system to keep track of this.
Either way, the use of shaders would need to fall back to the software-rasterization path (possibly slotting the shader function in-place of the Blend operator). Where, generally, TKRA-GL had combined both the Source and Destination blend operators into a single function-pointer.
If I were to implement shaders and a GLSL compiler, could jump from ~ GL 1.2/1.3 territory up to GL 2.x ...
Some other 2.x features, like occlusion queries, have already been implemented (but, the shader compiler is the hard part in this case).
Though, ironically, if it supported shaders, and the shaders "didn't suck", it would be ahead of both of my laptops:
2003 laptop: No shader support;
2009 laptop:
Shaders can be enabled in driver;
Shader performance is unusable (immediately drops to a slide-show).
The 2009 laptop was a motivating factor in my first foray into software-rendered OpenGL, as ironically on a 2.1 GHz Intel Core based CPU, was not *that* much slower than the HW renderer.
Both laptops managed to have a commonality:
Both could run Half-Life and fell on their face for anything much newer (but, for seemingly different reasons).
Though, seemingly, it seems like the 2003 laptop could be faster than it seems to be.
I suspect it may be held back by the RAM:
On some tests, the factor by which its performance beats the BJX2 core, is closer to the ratio of memory bandwidth (rather than the ratio of clock speeds).
Side note: a lot of this is based on information "from memory", so no claims about accuracy.
The 2009 laptop has only 50% more MHz, but runs circles around the older laptop in terms of CPU side performance (and was hindered mostly by the seemingly terrible integrated GPU).
Like, while the CPU is 50% faster, the RAM seems to be around 5x faster (or, ~ 2GB/s memcpy vs ~ 400MB/s).
Though, loosely lines up with the stats on the RAM modules:
BJX2 Core : DDR2, 16-bit, 50 MHz, ~ 55 MB/s
2003 Laptop: DDR1, 64-bit, 100 MHz, ~ 400 MB/s
2009 Laptop: DDR3, 64-bit, 667 MHz, ~ 2000 MB/s
One would expect an 8x ratio between the BJX2 core and 2003 laptop, observation seems closer to 7x.
Both laptops have 2 DIMMs, but performance seems to match expectations from 1 DIMM.
Observations tend to undershoot the theoretical bandwidth, but usually in memcpy tests I don't see much than around 1/4 the theoretical bandwidth number.
Theoretical limit should be 50% for memcpy, 100% for memset, observation typically 25% for memcpy, 50% for memset.
Well, except on the BJX2 core where it seems to be higher than the theoretical estimate (may be an issue with the measurement), and gets nearly full RAM bandwidth for memset.
Appears both the BJX2 core and 2003 laptop have the same size of L2 cache (256K on the XC7A100T).
For L1 local copy, roughly:
BJX2 core: ~ 290 MB/s (theoretical limit, 400 MB/s)
2003 : ~ 900 MB/s (theoretical limit: 2.1 GB/s)
2009 : ~ 4700 MB/s (theoretical limit: 6.3 GB/s)
The BJX2 has a higher theoretical limit (relative to clock-speed) here due to being able to load/store 16 bytes per clock cycle.
For the 2003 laptop, assuming a limit of 4 bytes/cycle (DWORD load/store, with pipelined blocks), but seems to underperform for some reason (giving ~ 43% of the theoretical limit, rather than ~ 70-75%).
For the 2009 laptop, assuming QWORD copy, seems to operate within expectations (~ 75%). Though, for the 2009 laptop, SSE based copies are also an option (has "MOVDQU" and similar).
Quake3 had seemingly used a trick of using x87 operations to move memory around, but this seems to be slower than using DWORDs. Trying to compile the same C code on the 2009 laptop (as 64-bit) breaks because "MOVSD" seems to mangle anything with NaNs on the CPU in the 2009 laptop.
...
Terje