Sujet : Re: Continuations
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 18. Jul 2024, 00:17:37
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <99f80e5c5452ec87cf6f5a70dcb33863@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10 11 12
User-Agent : Rocksolid Light
On Wed, 17 Jul 2024 20:56:06 +0000, Stephen Fuld wrote:
MitchAlsup1 wrote:
>
On Wed, 17 Jul 2024 18:30:47 +0000, Stephen Fuld wrote:
>
MitchAlsup1 wrote:
>
On Wed, 17 Jul 2024 16:50:27 +0000, Thomas Koenig wrote:
>
MitchAlsup1 <mitchalsup@aol.com> schrieb:
>
What I am talking about is to improve their performance until a
sin() takes about the same number of cycles of FDIV, not 10×
more.
>
Maybe time for a little story.
>
Some unspecified time ago, a colleague did CFD calculations
which included fluid flow (including turbulence modelling and
diffusion) and quite a few chemical reactions together. So, he
evaluated a huge number of Arrhenius equations,
>
k = A * exp(-E_a/(R*T))
>
and because some of the reactions he looked at were highly
exothermic or endothermic, he needed tiny relaxation factors
(aka small steps). His calculaiton spent most of the time
evaluating the Arrhenius equation above many, many, many, many
times.
>
A single calculation took months, and he didn't use weak
hardware.
>
A fully pipelined evaluation of, let's say, four parallel exp
and four parallel fdiv instructions would have reduced his
calculation time by orders of magnitude, and allowed him to
explore the design space instead of just scratching the surface.
>
(By the way, if I had found a reasonable way to incorporate the
Arrhenius equation into your ISA, I would have done so already
:-)
>
FMUL Rt,RR,RT
FDIV Rt,-RE,Rt
EXP Rt,Rt
FMUL Rk,RA,Rt
>
Does not look "all that bad" to me.
>
So for your GbOoO CPU, how many of the various FP operations, and
the EXP instruction can be done in parallel?
>
FMUL is 4 cycles of latency fully pipelined
FDIV is ~20 cycles of latency not pipelined
EXP is ~16 cycles of latency not pipelined
>
They are all performed in the FMAC unit and here the instructions are
serially dependent.
>
So, 44 cycles of latency, a 1-wide machine and a 6-wide machine would
see the same latency; that is, GBOoO is not a differentiator.
>
>
Good, I get that. But Thomas' original discussion of the problem
indicated that it was very parallel, so the question is, in your
design, how many of those calculations can go in in parallel?
The FDIV and EXP instructions consume all the FMAC cycles, so even if
you completely unrolled the loop, you are not going to get more than
6-cycles less in performing the repeated calculations.
A really BIG implementation with 4 FMAC units per core could unroll
the loop (by reservation stations) such that each iteration would
still be 44-cycles, but you could run 4 in parallel and achieve 4
results every 44-cycles--which to most people smells like 1 result
every 11-cycles.
{Would be an interesting reservation station design, though}
>
>
>