Sujet : Re: Making Lemonade (Floating-point format changes)
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.archDate : 20. May 2024, 09:30:45
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240520113045.000050c5@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Mon, 20 May 2024 09:24:16 +0200
Terje Mathisen <
terje.mathisen@tmsw.no> wrote:
Michael S wrote:
On Sun, 19 May 2024 18:37:51 +0200
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:
So, I did some more measurements on the POWER9 machine, and it
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.
>
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
>
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
The FMA normalizer has to handle a maximally bad cancellation, so
it needs to be around 350 bits wide. Mitch knows of course but I'm
guessing that this could at least be close to needing an extra
cycle on its own and/or heroic hardware?
>
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
Terje
1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user code,
rather than in test vector, is lower than DP by another order of
magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it can
be achieved by introduction of one more pipeline stage everywhere.
Since we are discussing high-latency design akin to POWER9, the
relative cost of another stage would be lower. BTW, according to POWER9
manual, even for SP/DP FMA the latency is not constant. It varies from
5 to 7.
So, IMHO, what you do to handle sub-normal inputs should depend on what
ends up smaller or faster, not on some abstract principles. For less
important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.
Besides, pre-normalization vs wider post-normalization are not the only
available choices. When multiplier is naturally segmented into 57-bit
section, there exists, for example, an option of pre-normalization by
full section. It looks very simple on the front and saves quite a lot
of shifter's width on the back.
But the best option is probably described in above post by Mitch. If I
understood his post correctly, he suggests to have two alignment stages:
one after multiplication and another one after add/sub. The shift count
for a first stage is calculated from inputs in parallel with
multiplication. The first alignment stage does not try to achieve a
perfect normalizations, but it does enough for cutting the width of
following adder from 3N to 2N+eps.