Liste des Groupes | Revenir à c arch |
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it needs to be around 350 bits wide. Mitch knows of course but I'm
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
guessing that this could at least be close to needing an extra cycle
on its own and/or heroic hardware?
Terje
Why so wide?Consider a 128-bit FP container. 1-bit sign
Assuming that subnormal multiplier inputs are normalized beforeBad assumption for HW, maybe acceptable for SW.
multiplication,
the product of multiplication is 226 bits with two MSAugend can be positioned 113-bits above the tree or right below the
bits != '00'. I don't see how we would ever need more than 229 bits fed
into accumulation phase and into following normalizer.
Of course, all
bits that are lower that LS bit have to be collapsed (by OR) into LS
bit. May be, even less than 229 bits will do, by now I am not sure.
Les messages affichés proviennent d'usenet.