Liste des Groupes | Revenir à c arch |
MitchAlsup1 wrote:OK.BGB-Alt wrote:So this is basically due to the product part still being in carry-save format, so it cannot easily be moved/aligned, instead the augend has to be able to move to either side of it. OK, that makes sense!
>On 5/20/2024 7:28 AM, Terje Mathisen wrote:>Anton Ertl wrote:Michael S <already5chosen@yahoo.com> writes:>On Sun, 19 May 2024 18:37:51 +0200>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:The FMA normalizer has to handle a maximally bad cancellation, so it>
needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
>
Terje
>
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
multiplication, the product of multiplication is 226 bits
The product of the mantissa multiplication is at most 226 bits even if
you don't normalize subnormal numbers. For cancellation to play a
role the addend has to be close in absolute value and have the
opposite sign as the product, so at most one additional bit comes into
play for that case (for something like the product being
0111111... and the addend being -10000000...).
This is the part of Mitch's explanation that I have never been able to totally grok, I do think you could get away with less bits, but only if
>
you can collapse the extra mantissa bits into sticky while aligning the
>
product with the addend. If that takes too long or it turns out to be easier/faster in hardware to simply work with a much wider mantissa, then I'll accept that.
>
I don't think I've ever seen Mitch make a mistake on anything like
this!
>It is a mystery, though seems like maybe Binary128 FMA could be done in>
>
software via an internal 384-bit intermediate?...My thinking is, say, 112*112, padded by 2 bits (so 114 bits), leads to 228 bits. If one adds another 116 bits (for maximal FADD), this comes>
to
>
344.
Maximal product with minimal augend::
>
pppppppp-pppppppp-aaaaaaaa
>
Maximal augend with minimal product
>
aaaaaaaa-pppppppp-pppppppp
>
So the way one builds HW is to have the augend shifter cover the whole
4×
length and place the product in the middle::
>
max min
aaaaaaaa-aaaaaaaa-aaaaaaaa-aaaaaaaa
pppppppp-pppppppp
>
The output of the product is still in carry-save form and the augend is
in pure binary so the adder is 3-input for 2×-width. This generates a
carry into the high order incrementor.
>
So one has a sticky generator for the right hand side augend, and an
incrementor for the left hand side augend. When doing high speed de-
normals one cannot count on the left hand side of product to have HoBs
set with standard ramifications (imaging a denorm product and a denorm
augend and you want the right answer.)
>
Any way you cook it, you have a 4× wide intermediate (minus 2-bits
IIRC).
4×112 = 448 -2 = 446.
There is a reason these things are not standard at this point of
technology.
When I looked into possible options, the most sensible option at the time seemed to be to support some 128-bit ALU instructions to allow for faster software emulation. Mostly, the cost of having 128-bit ALU ops being significantly less than that of doing 128-bit floating-point in hardware.>This is an intentional feature, not a bug!
Could you do it (IEEE accuracy) with less HW--yes, but only if you
allow
certain special cases to take more cycles in calculation. At a certain
point (a point made by Terje) it is easier to implement with wide
integer
calculations 128+128 and/or 128*128 along with double width shifts,
inserts,
and extracts.
>
IEEE did not make these things any easier by having a 2× std width fraction have 2×+3 bits of length requiring 8 multiplications with
minimal HW instead of 4 multiplications. On the other hand IBM did us
no favors with Hex FP either (keeping the exponent size the same and
having 2×+8 bits of fraction.)
By making sure that all ieee larger formats have a mantissa with at least 2n+3 bits compared to the smaller format below, you avoid all double rounding issues if you do a calculation in the larger format and then immediately store it back to a smaller format container.
By also having a wider exponent you can do things like sqrt(x^2+y^2) and completely avoid spuriouos overflows during the squaring ops: As long as the final result fits in float, it will be the correct result.
We started out with 1:8:23 and 1:11:52, then we got 1:15:112 at the higher end and 1:5:10 for fp16 and 1:3:4 for fp8.
Do note that the 8 and 16-bit variants do break the 2n+3 rule, also note that the AI training people like truncated 32-bit, i.e. 1:8:7 which keeps the full float range but with ~1/3 the mantissa resolution.
Anyway, doing fp128 in SW I would of course do it using u64 unsigned integer ops: FMUL128 becomes 4 64x64->128 MUL ops plus the adding/merging of the terms and a bunch of book keeping work on the signs and exponents.
With a single fully pipelined integer multiplier taking 4 cycles, this would be 7 cycles for the MULs, with the last three cycles overlapped with the initial ADD/ADC operations. Seems like it could be doable in sub-20 cycles?
I'm assuming the CPU to be wide enough that the special cases can be handled in parallel with the default/normal inputs case, also assuming reg-reg MOVes to be zero cycles, handled in the renamer, in order to overcome the dedicated register (RDX) issue which we have retained even using MULX.
Terje
Les messages affichés proviennent d'usenet.