Liste des Groupes | Revenir à c arch |
Thomas Koenig wrote:If you organize the multiplications and accumulations from mostSo, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it needs to be around 350 bits wide. Mitch knows of course but I'm guessing
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
that this could at least be close to needing an extra cycle on its own and/or heroic hardware?
Terje
Les messages affichés proviennent d'usenet.