On 2/18/2025 7:07 AM, Michael S wrote:
On Tue, 18 Feb 2025 02:55:33 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:
>
It takes Round Nearest Odd to perform Kahan-Babashuka Summation.
>
Are you aware of any widespread hardware that supplies Round to Nearest
with tie broken to Odd? Or of any widespread language that can request
such rounding mode?
Until both, implementing RNO on niche HW looks to me as wastage of both
HW resources and of space in your datasheet.
Instead, think of what you possibly forgot to do in order to help
software implementation IEEE binary128. That would be orders of
magnitude more useful in real world. And don't take me wrong, "orders of
magnitude more useful" is still small niche on the absolute scale of
usefulness.
IME, what helps with Binary128 is mostly support for efficient handling of big integer values:
128-bit, ideally fast;
256-bit, at least semi-efficient
Will need 256-bit ADD/SUB.
You will likely need a 128 x 128 -> 256-bit widening multiplier.
In my case, this is best implemented with 32x32->64 bit widening multiply ops, and MOVLLD/MOVLHD/MOVHLD/MOVHHD (roughly equivalent to PCKBB/PCKBT/PCKTB/PCKTT in the RV-P extension, also MOVLLD is equivalent to 'PACK' but RV-B lacks the other variants, 1).
1: Where 'B' has a subset of things that are useful, a bunch of holes where stuff that would be useful is absent or had been dropped, and a bunch of random stuff that seems very niche and/or not likely to be all that useful.
That is:: comply with IEEE 754-2019
>
I'd say, comply with mandatory requirements of IEEE 754-2019.
For optional requirements, be selective. Prefer those that can be
accessed from widespread languages (including incoming editions of
language standards) over the rest.
I go the direction of thinking it might be instead preferable to do a different direction, and instead of trying to go for "numerical purity" or "having a bunch of math features in hardware", instead try to optimize for something that:
More or less gives the useful parts of the IEEE specs;
Is possible to make integer exact within a reasonable cost.
Would keep the formats from the newer standards, though probably toss out the Decimal formats.
Though, this would go in a different direction, say:
DAZ+FTZ with Truncate as semi-canonical;
Sub-ULP bits explicitly fall off the bottom in a defined way.
Could optimize for implementation with either hardware or with equivalent-size integer operations.
Say, one could imagine an abstract model where Binary64 FADD works sort of like:
sgnA=valA>>63;
sgnB=valA>>63;
expA=(valA>>52)&2047;
expB=(valB>>52)&2047;
fraA=(valA&((1ULL<<52)-1));
fraB=(valB&((1ULL<<52)-1));
if(expA!=0)fraA|=1ULL<<52;
if(expB!=0)fraB|=1ULL<<52;
fraA=fraA<<9; //9 sub ULP bits
fraB=fraB<<9;
shrA=(expB>=expA)?(expB-expA):0;
shrB=(expA>=expB)?(expA-expA):0;
sgn2A=sgnA; exp2A=expA; fra2A=fraA>>shrA;
sgn2B=sgnB; exp2B=expB; fra2B=fraB>>shrB;
//logical clock-edge here.
fr1C_A=fra2A+fra2B;
fr1C_B=fra2A-fra2B;
fr1C_C=fra2B-fra2A;
if(sgn2A^sgn2B)
{
if(fr1C_C>>63)
{ sgn1C=sgn2A; fra1C=fr1C_B; }
else
{ sgn1C=sgn2B; fra1C=fr1C_C; }
}
else
{ sgn1C=!sgn2A; fra1C=fr1C_A; }
//logical clock-edge here.
if(fra2C>>62)
{ exp3C=exp2C+1; fra3C=fra2C>>1; }
else
{ shl=clz64(fra2C)-2; exp3C=exp2C-shl; fra3C=fra2C<<shl; }
//logical clock-edge here.
if((exp3C>=2047) || (exp3C<=0))
{ sgnC=sgn2C; expC=(exp3C<=0)?0:2047; fraC=0; }
else
{
sgnC=sgn2C; expC=exp3C; fraC=fra3C>>9;
//if rounding is done, it goes here.
}
valC=(sgnC<<63)|(expC<<52)|fraC;
//final clock edge.
//result is now ready.
There are some other tricks that are possible in a Verilog implementation but are absent in this C like model, but they don't change the end result.
The main expensive parts being the mantissa-adder and shifts.
The shift usually requires log2(N) stages of 2-input MUXing.
Or, 1 MUX stage for each bit of the shift-amount value.
Though, ad least on Xilinx hardware, 2-bits of MUX can be fused into 1 stage of LUT6's (so, a 64-bit shift needs ~ 3 levels of LUT6).
Still not great to try doing much more than a shift in a single clock cycle. Though more resource budget, parallel shifts and add/subtract is lower latency than trying to detect and inverse the output (and bitwise NOT does not give acceptable results here, the subtracts need to be two's complement).
It can be cheaper to implement the FPU with ones' complement, but this does not give acceptable results if one tries doing integer operations of the FPU (it can be considered as a reasonable request that integer math done via the FPU gives exact integer answers for the ranges covered by the mantissa range).
But, this logic is still annoyingly expensive...
Generally, some special cases can be added on the input and output side to allow for integer conversion.
Int->Binary64:
Handled as an ADD between a negative zero and a synthesized FP value (non-normalized). Basically, fake the signs and exponent and shove the integer value into the mantissa.
Binary64->Int:
Handled as an ADD between a zero mantissa with a large exponent, and the input value. The resulting mantissa (extracted before normalization, and with some sign-selection hacks) being used as the output value.
It is tempting to consider an intermediate between Binary16 and BF16:
Binary16: Often not enough dynamic range, but more precision than needed;
BF16: Overkill dynamic range, not enough precision.
I am starting to suspect S.E6.M9 might have been closer to ideal.
S.E7.M8 goes probably too far.
It is annoying that one needs to consider having both Binary16 and BF16, but adding a 3rd format to the mix wouldn't necessarily make this better.
There are cases where Binary16 has not enough precision, but usually these end up being handled by 16-bit fixed-point. But, if anything, a case could be made for for integer conversions with a variable exponent offset (to support fixed-point).
Say (in RV terms):
FCNVSC.X.D Xd, Fs, ImmExpAdj //convert to integer with scale
FCNVSC.D.X Fd, Xs, ImmExpAdj //convert to double with scale
Which could save using multiply/divide to scale FPU values.
Also maybe instructions to add/subtract a value from the exponent without needing to use a multiply.
Say:
FADJEXP Fd, Fs, ImmExpAdj
Which does the equivalent of multiply/divide by power of 2.
Could take the place of multiply in operations like:
y=x*4096;
Or:
y=x/4096;
...