Liste des Groupes | Revenir à c arch |
On 5/19/2024 11:37 AM, Terje Mathisen wrote:Thomas Koenig wrote:So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it needs to be around 350 bits wide. Mitch knows of course but I'm guessing
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
>
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
>
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
>
that this could at least be close to needing an extra cycle on its own and/or heroic hardware?
This sort of thing is part of what makes proper FMA hopelessly expensive.Getting the LoB correctly rounded showed up the generation prior to
Granted, full FMA also allows faking higher precision usingIt also enabled error free floating point calculations, but no existing
SIMD vector operations, with math that does not work with double-rounded
FMA instructions.
Well, and also an issue if one can "just barely" afford to have a singleThis is NOT an architectural issue, but an implementation choice issue.
double-precision unit.
Though, the trick of possibly having four 27-bit multiplies which combine into a virtual 54 bit multiplier seems like an interesting possibility, though not great as DSP's don't natively handle this size (and would be too expensive to stretch it out with LUTs). Likely, one would need to build it from 34*34->68 bit multipliers (each costing 4This is your implementation choice coloring what you take as
DSPs).
In terms of DSP cost, it would be higher than the current solution:We can now fit (5nm) hundreds of GBOoO cores on a single die. The
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64 multipliers, the shortfall is made up using smaller LUT-based
multipliers).
Though, with the combiner option, one could make a case for, say, a:
S.E15.F66.Z46 format (Z=zeroed/ignored).
Well, and/or accept the wonk of a Binary128 which produces 112 bits of mantissa, but only uses the high 66 bits or so, but generally this was worse for some things in some tests than one which simply zeroes the low-order bits.But it allows for exact FP arithmetic, and for FMAC, ..... and lots of other good properties.
But, OTOH, 66*66->112 would allow for possible trickery to fake a full Binary128 FMUL in software as a multi-part process (when combined with aA 1-bit wide machine can perform 128 × 128 + 128 FMACs -- it just takes
Binary128 FADD).
....
Terje
>
Les messages affichés proviennent d'usenet.