Sujet : Re: Continuations
De : SFuld (at) *nospam* alumni.cmu.edu.invalid (Stephen Fuld)
Groupes : comp.archDate : 17. Jul 2024, 21:56:06
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <v79b56$20oq8$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11
User-Agent : XanaNews/1.21-f3fb89f (x86; Portable ISpell)
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
MitchAlsup1 <mitchalsup@aol.com> schrieb:
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×
more.
Maybe time for a little story.
Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,
k = A * exp(-E_a/(R*T))
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.
A single calculation took months, and he didn't use weak
hardware.
A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
Does not look "all that bad" to me.
So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
They are all performed in the FMAC unit and here the instructions are
serially dependent.
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
-- - Stephen Fuld (e-mail address disguised to prevent spam)