Liste des Groupes | Revenir à c arch |
On Mon, 20 May 2024 14:22:00 +0200For most non-tiny formats, the seeming advantage of subnormal numbers seems small, in any case.
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:For subnormal x subnormal you don't need result of multiplication atOn Mon, 20 May 2024 09:24:16 +0200>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Michael S wrote:>On Sun, 19 May 2024 18:37:51 +0200>
Terje Mathisen <terje.mathisen@tmsw.no> wrote:
Thomas Koenig wrote:>So, I did some more measurements on the POWER9 machine, and itThe FMA normalizer has to handle a maximally bad cancellation, so
came to around 18 cycles per FMA. Compared to the 13 cycles for
the FMA instruction, this actually sounds reasonable.
>
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
>
A fully pipelined FMA unit capable of 128-bit arithmetic would
be an entirely different beast, I would expect a throughput of
1 per cycle and a latency of (maybe) one cycle more than 64-bit
FMA.
it needs to be around 350 bits wide. Mitch knows of course but
I'm guessing that this could at least be close to needing an
extra cycle on its own and/or heroic hardware?
>
Terje
Why so wide?
Assuming that subnormal multiplier inputs are normalized before
They are not, this is part of what you do to make subnormal numbers
exactly the same speed as normal inputs.
>
Terje
1. I am not sure that "the same speed" is a worthy goal even for
binary64 (for binary32 it is).
2. It's certainly does not sound like a worthy goal for binary128,
where probability of encountering sub-normal inputs in real user
code, rather than in test vector, is lower than DP by another order
of magnitude,
3. Even if, for reason unclear to me, it is considered the goal, it
can be achieved by introduction of one more pipeline stage
everywhere. Since we are discussing high-latency design akin to
POWER9, the relative cost of another stage would be lower. BTW,
according to POWER9 manual, even for SP/DP FMA the latency is not
constant. It varies from 5 to 7.
>
So, IMHO, what you do to handle sub-normal inputs should depend on
what ends up smaller or faster, not on some abstract principles.
For less important unit, like binary128, 'smaller' would likely take
relative precedence over 'faster'. It's possible that you'll end up
with not doing pre-normalization, but the reason for it would be
different from 'same speed'.
>
Besides, pre-normalization vs wider post-normalization are not the
only available choices. When multiplier is naturally segmented into
57-bit section, there exists, for example, an option of
pre-normalization by full section. It looks very simple on the
front and saves quite a lot of shifter's width on the back.
>
But the best option is probably described in above post by Mitch.
If I understood his post correctly, he suggests to have two
alignment stages: one after multiplication and another one after
add/sub. The shift count for a first stage is calculated from
inputs in parallel with multiplication. The first alignment stage
does not try to achieve a perfect normalizations, but it does
enough for cutting the width of following adder from 3N to 2N+eps.
I do agree with Mitch's suggestion: Allow subnormal inputs but do the
partial muls from the top and move the normalization starting point
down for each all-zero input block.
>
In an extreme case (subnormal x subnormal) this would allow you to
discard a lot of partial products.
>
Terje
>
all. All you need to know is if it's zero or not and what sign.
Even that is needed only in non-default rounding modes and for inexact
flag in default mode.
Les messages affichés proviennent d'usenet.