Liste des Groupes | Revenir à c arch |
On 5/20/2024 7:28 AM, Terje Mathisen wrote:Anton Ertl wrote:Michael S <already5chosen@yahoo.com> writes:This is the part of Mitch's explanation that I have never been able to totally grok, I do think you could get away with less bits, but only ifOn Sun, 19 May 2024 18:37:51 +0200>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:The FMA normalizer has to handle a maximally bad cancellation, so it>
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
>
Terje
>
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits
The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).
you can collapse the extra mantissa bits into sticky while aligning the
product with the addend. If that takes too long or it turns out to be easier/faster in hardware to simply work with a much wider mantissa, then I'll accept that.
I don't think I've ever seen Mitch make a mistake on anything like
this!
It is a mystery, though seems like maybe Binary128 FMA could be done in
software via an internal 384-bit intermediate?...
My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to 228 bits. If one adds another 116 bits (for maximal FADD), this comesMaximal product with minimal augend::
to
344.
In this case, 384 bits would be because my "_BitInt" support code pads things to a multiple of 128 bits (for integer types larger than 256
bits).
It isn't fast, but I am not against having Binary128 being slower,
since
if one is using Binary128 ("long double" or "__float128" in this case),
it is likely the case that precision is more a priority than speed.
Though, as of yet, there is no Binary128 FMA operation (in the software
runtime). Could potentially add this in theory.
I guess, maybe also possible could be whether to add the FADDX/FMULX/FMACX instructions in a form where they are allowed, but will be turned into runtime traps (would likely route them through the TLB Miss ISR, which thus far has ended up as a catch-all for this sort of thing...).
Though, likely more efficient would still be "just use the runtime
calls".
Terje
>
Les messages affichés proviennent d'usenet.