Sujet : Re: Making Lemonade (Floating-point format changes)
De : cr88192 (at) *nospam* gmail.com (BGB)
Groupes : comp.archDate : 15. May 2024, 19:44:25
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v22vqg$11i4m$1@dont-email.me>
References : 1 2 3
User-Agent : Mozilla Thunderbird
On 5/15/2024 4:07 AM, Michael S wrote:
On Sun, 12 May 2024 15:30:40 +0200
wolfgang kern <nowhere@never.at> wrote:
On 12/05/2024 05:44, John Savard wrote:
I've made another long-overdue change in the Concertina II
architecture on the page about 17-bit instructions.
>
Since I describe the individual instructions there, with their
opcodes and what they do, I've illustrated the floating-point
formats of the architecture on that page.
>
The good people in charge of the IEEE 754 standard had seen fit to
define a standard 128-bit floating-point format which included a
hidden first bit.
>
This annoyed me greatly, because I was going to take the 8087's
temporary real format, and extend the mantissa for my 128-bit
format.
>
I've decided that it's necessary to fully accept the 128-bit
standard and support it in a consistent manner.
>
Therefore, I have taken the following actions:
>
I have dropped the option of supporting 80-bit temporary reals
entirely, as they are now incompatible as an internal format.
>
I have instead defined a 256-bit format for floats which does not
have a hidden first bit, which looks like the old temporary reals,
except that the exponent field is one bit wider.
>
And in addition, just as the IBM 704 used two single-precision
floats to make a double-precision float, and the IBM System/360
Model 85 started using two double-precision floats to make an
extended precision float... I've defined how the 256-bit internal
format floats can be doubled up to make a 512-bit float.
>
I'm not really sure such floating-point precision is useful, but I
do remember some people telling me that higher float precision is
indeed something to be desired. Well, the IEEE 754 standard has
forced my hand.
>
YES, I'd use something similar:
I never cared nor supported any odd 10 byte formats and I give a fart
to all these weird IEEE standards.
>
I suppose, it's mutual.
In my case, I care about what the IEEE standard says to what extent it seems relevant and justified.
In practice, this mostly means drawing a line in the sand for subnormal numbers and stuff that exists entirely in the sub ULP domain.
If you need to roughly double the cost of the FPU for sake of a fraction of a bit of rounding accuracy, this does not seem justifiable.
One could argue that determinism is important, but determinism could be achieved more cheaply via other means:
Truncate rounding;
Explicitly discarding low-order results.
For some amount of fixed-point code, they reduce the issue of low-order results by doing right shifts before the multiply.
So, rather than, say:
z=(x*y)>>16;
You have:
z=(x>>8)*(y>>8);
Though, effectively discarding half the mantissa on input for FMUL would be undesirable, as it would significantly reduce precision. Usually it is done in cases where speed matters more than accuracy, and a full-width multiply would require using a slower multiply internally (such as a "long long" multiply).
In my ISA though, I partly addressed this scenario by adding widening 32-bit multiply ops (32*32->64).
One could instead define the multiplier as, say:
z=((x>>8)*(y>>8))+(((x&255)*(y>>8))>>8)+(((x>>8)*(y&255))>>8);
But, granted, this still produces some intermediate low-order bits only to discard them.
Though, this pattern isn't too far off from what my FPU uses (splitting it up among the DSP multipliers and a few smaller LUT-based multipliers to try to fudge the low-order bits).
Annoyingly, the exact pattern for a strict-truncate would cost more to implement than its inexact constructions when dealing with hard-logic multipliers.
For the most part, application code doesn't care...
Though, if a program tries to use an Newton-Raphson loop that terminates when the result converges exactly, it will tend to get stuck in an infinite loop as exact convergence is never achieved. Typical workaround is to use a fixed number of loop iterations instead.
Granted, will not claim it is a perfect solution, but mostly "good enough".
The results from strict truncate would be more obvious IME, mostly in that calculations that feed back into themselves (and assume round-nearest) will tend to drift.
But, for the most part, a "round the low 8 bits unless it would result in a carry" is also "good enough"...