Liste des Groupes | Revenir à c arch |
On Thu, 19 Sep 2024 15:07:11 +0000, EricP wrote:
Brett wrote:EricP <ThatWouldBeTelling@thevillage.com> wrote:
They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.
That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.
You failed to recognize the critical part of my comment on this::
When the IMUL function unit sees MULL and MULH back to back AND
when both operands are the same for both instructions; it KNOWS
that the second multiply has the same result as the first and
thereby that the second multiply can be suppressed and the first
multiply used twice. {{In pure CMOS, if you drop the same operands
twice into the multiplier tree, the multiplier tree burns no power
in any event, just the operand delivery power.}}
You may call this fusion, but it is the very lowest level of it
and was not called such when first used.
<sound of soap box being dragged out>
- register specifier fields are either source or dest, never both>
I happen to be wishywashy on this
Les messages affichés proviennent d'usenet.