Liste des Groupes | Revenir à c arch |
On Sun, 19 May 2024 11:17:41 -0000 (UTC)
Thomas Koenig <tkoenig@netcologne.de> wrote:
So, I did some more measurements on the POWER9 machine, and it came
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
I.e. your actual running frequency was 3700 MHz?
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
There exists a middle ground between none-pipelined and fully pipelinedThat is the slow middle ground using the multiplier at ½ rate. AND is
multiplier/FMA units. In fact, more than one middle ground.
Here the mid-middle ground that can imagine not being a real hardware
guy: 1 - take a pair of exiting VSU multipliers. By now they can do
53x53=>125bit unsigned multiplication. Enhance them to 57x57=>113bit
2 - during quad-precision FMA split 113x113 multiplication into 4
pieces and run them through pair of multiplies each two at once.
That would produce all parts of 225-bit product at rate of 1 product
per 2 clocks
3 - build adders just sufficient for the same throughput of 1 result
per 2 clocks.
Such combined multiplier will have 2 clocks higher latency than DP
multiplier.
After that we'll need matching alignment and addition/subtraction
blocks, but by doing them half-pipelined we can utilize majority of
existing dual-DP hardware and would need very little else, except of
control signals and probably of new feedback data path on the upper
side of the adder. All that could cost us another clock of latency over
DP FMA, but not necessarily so.
Bottom line: QP FMA with throughput of 1 result per 2 clocks and
latency of 8 or 9 clocks.
For POWER8, that has less distributed VSU, such modification would be
somewhat easier than for POWER9.
That's what I call a mid-middle ground.
Low-middle ground would be leaving 53x53=>125bit multipliers
unmodified. 113x113 multiplication is split into 9 pieces and
product is delivered every 5 clocks.
High-middle ground is enhancing both VSU pipes and using them to
process two QP FMAs simultaneously for combined throughput equivalent
to fully pipelined.
Another possible high-middle ground is, again, enhancing both VSU pipes
and using them together on a single QP FMA. That would be potentially
best for latency, but does not fit well into philosophy of POWER9
design that tries to minimize high-speed interaction between various
pipes.
Les messages affichés proviennent d'usenet.