Sujet : Re: Continuations
De : tkoenig (at) *nospam* netcologne.de (Thomas Koenig)
Groupes : comp.archDate : 18. Jul 2024, 07:00:46
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v7ab2e$29lv1$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : slrn/1.0.3 (Linux)
MitchAlsup1 <
mitchalsup@aol.com> schrieb:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
>
MitchAlsup1 wrote:
>
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
>
MitchAlsup1 <mitchalsup@aol.com> schrieb:
>
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
>
Maybe time for a little story.
>
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
>
k = A * exp(-E_a/(R*T))
>
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
>
A single calculation took months, and he didn't use weak hardware.
>
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
>
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
>
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
>
Does not look "all that bad" to me.
>
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
>
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
Ah, OK.
>
They are all performed in the FMAC unit and here the instructions are
serially dependent.
A loop containing the calculation could be unrolled, but without
a large effect.
>
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
What about SIMD width underlying the the VVM implementation?
All SIMD implementations I know of allow performing floating point
ops in paralell. Is it planned that My 66000 can also do that?
(If not, that would be a big disadvantage for scientific/technical
work).