Liste des Groupes | Revenir à c arch |
BGB wrote:Granted, but exposing these bits is for proper FMA, but but not really what one gets if just gluing together a multiplier and an adder (and effectively doubling the latency).
On 5/19/2024 4:16 PM, MitchAlsup1 wrote:BGB wrote:
>On 5/19/2024 11:37 AM, Terje Mathisen wrote:>Thomas Koenig wrote:So, I did some more measurements on the POWER9 machine, and it cameThe FMA normalizer has to handle a maximally bad cancellation, so it needs to be around 350 bits wide. Mitch knows of course but I'm
to around 18 cycles per FMA. Compared to the 13 cycles for the
FMA instruction, this actually sounds reasonable.
>
The big problem appears to be that, in this particular
implementation, multiplication is not pipelined, but done by
piecewise by addition. This can be explained by the fact that
this is mostly a decimal unit, with the 128-bit QP just added as
an afterthought, and decimal multiplication does not happen all
that often.
>
A fully pipelined FMA unit capable of 128-bit arithmetic would be
an entirely different beast, I would expect a throughput of 1 per
cycle and a latency of (maybe) one cycle more than 64-bit FMA.
>
guessing
>
that this could at least be close to needing an extra cycle on its own and/or heroic hardware?
>This sort of thing is part of what makes proper FMA hopelessly>
expensive.
Getting the LoB correctly rounded showed up the generation prior to
FMAC showing up.
>Well, in this case, I have neither in a proper sense.FMAC operators were sorta faked, but mostly exist because they were needed for RV64G, but double-rounded (and not able to expose anything that exists below the ULP, unlike proper FMA).But FMAC can expose the bits below LoB.
OK.Granted, full FMA also allows faking higher precision using>
>
SIMD vector operations, with math that does not work with
double-rounded
>
FMA instructions.
It also enabled error free floating point calculations, but no existing
FP implementation allows exact FP calculations that do not ALSO SET the
inexact flag !?!? {Whereas My 66000 gets this right}
>Dunno.It seems like the existence of anything below the ULP justifies settingYou misunderstand !!
>
the inexact flag...
When one computes 2 Operands that are single wide, and can deliver a single result twice as wide or a pair of results each single wide,
you are delivering all the bits, so there is no inexact. However, if you use more than 1 instruction to perform the calculation, then, you
HAVE to set an inexact bit even though the delivery of the second
result makes the first setting of the inexact bit in error !!
My ISA is expressive enough to do this, just like IEEE 754-2019
requires on augmented addition and augmented subtraction.
I have seen some amount of low-end chips which make the choice as:Well, and also an issue if one can "just barely" afford to have a>
single
>
double-precision unit.
This is NOT an architectural issue, but an implementation choice issue.
>Absent things like microcode or traps, architectural and implementationI understand your limitations--the problem I have is that you express
>
choices are closely tied together. Can't have instructions for things which one can't afford the hardware cost to implement.
your limitations AS-IF others should make the same choices you had to make. And that is patently FALSE !!
Defending an indefensible position under the illusion that "That's all IThere have been apparently more things killed off by slow performance than by lack of FPU accuracy.
got to work with" is an insufficient defense against someone who has
more.
Well, and the usefulness of an FPU is dependent on performance. Inaccurate FPU can still be useful, but slow FPU is not.Kahan has several lectures about this....
I am not claiming any of this "doesn't kinda suck", so what is the issue?...Though, the trick of possibly having four 27-bit multiplies which combine into a virtual 54 bit multiplier seems like an interesting possibility, though not great as DSP's don't natively handle this size (and would be too expensive to stretch it out with LUTs). Likely, one would need to build it from 34*34->68 bit multipliers (each costing 4>
DSPs).
This is your implementation choice coloring what you take as
architectural
decisions.
>In terms of DSP cost, it would be higher than the current solution:>
16 vs 6+4 (10).
But, possibly lower LUT cost (in both the Binary32 and Binary64 multipliers, the shortfall is made up using smaller LUT-based
multipliers).
We can now fit (5nm) hundreds of GBOoO cores on a single die. The
difference between a 53×53 tree and a 64×64 tree (makes all problems vanish) is
not
visible at this level (100+ cores on a die).
>
This is your implementation choice coloring you thoughts.
>I can afford FPGAs...I am not asking you to spend big money--I am merely asking you to quit
I can't afford to get an ASIC made.
defending "doing the wrong thing" when others have to follow standards.
{{If you properly caveated all your defense statements--I would not complain.}}
They are hard-logic in the FPGAs (similar to the Block-RAM).So, implementation choices here are:I have been wondering for a while--are the DSP things you build your
FPGA;
Nothing.
multiplier out of synthesized by Verliog compilation, or hard coded
into the gates themselves ?? Because if they are synthesized, you could
create Verilog that builds the multiplier tree of whatever size you
need
without all the DSP overhead.
Apparently more modern cars have mostly gone over to replacing all of the manual controls with a touchscreen.>
What kind of car do you drive ??
>I don't drive a car...I was going to ask if your car had hand rolled windows, a manual
I tend to fairly rapidly get tired out if trying to drive.
transmission, ... in the early 1980s all of us were similarly constrained, computer architecture grew out of the fast-and-dirty
modus operandi and into the follow-standards Operandi.
Les messages affichés proviennent d'usenet.