Sujet : Re: Continuations
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 18. Jul 2024, 17:40:11
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <2bdee5008840a5584e9de557a5dfd88d@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : Rocksolid Light
On Thu, 18 Jul 2024 12:10:45 +0000, EricP wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
>
MitchAlsup1 wrote:
>
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
>
MitchAlsup1 <mitchalsup@aol.com> schrieb:
>
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10× more.
>
Maybe time for a little story.
>
Some unspecified time ago, a colleague did CFD calculations which
included fluid flow (including turbulence modelling and diffusion)
and quite a few chemical reactions together. So, he evaluated a
huge number of Arrhenius equations,
>
k = A * exp(-E_a/(R*T))
>
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors (aka
small steps). His calculaiton spent most of the time evaluating
the Arrhenius equation above many, many, many, many times.
>
A single calculation took months, and he didn't use weak hardware.
>
A fully pipelined evaluation of, let's say, four parallel exp and
four parallel fdiv instructions would have reduced his calculation
time by orders of magnitude, and allowed him to explore the design
space instead of just scratching the surface.
>
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already :-)
>
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
>
Does not look "all that bad" to me.
>
So for your GbOoO CPU, how many of the various FP operations, and the
EXP instruction can be done in parallel?
>
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
>
They are all performed in the FMAC unit and here the instructions are
serially dependent.
>
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
>
If the FP multiplier is a 4-stage pipeline, and FDIV is iterating using
the multiplier, can the pipeline get a mix of multiple operations going
at once? FDIV for both Newton–Raphson and Goldschmidt iterates serially
so each can only use one of the 4 pipeline slots.
Over the 20 cycles the multiplier is doing Goldschmidt iterations, there
are only 3 slots where a different instruction could sneak through.
Note: the multiplier used in Goldschmidt iterations is used every cycle
first for the denominator being driven towards 1.0, the second driving
the numerator towards quotient.
That is, its a 4 cycle pipeline unit from the outside, but a 2 cycle
pipeline unit from within the function unit.