Liste des Groupes | Revenir à c arch |
Brett wrote:You failed to recognize the critical part of my comment on this::EricP <ThatWouldBeTelling@thevillage.com> wrote:>
>
They claim 5 cycles, should be six, five for the multiply and one more
for
the second result, unless the next instruction does not need a write
port,
and does not use the result. You can get a throughput of 5 cycles with
smart coding, but that rarely happens without effort.
That article is ignoring multiplier pipelining.
If the multiplier is pipelined with a latency of 5 and throughput of 1,
then MULL takes 5 cycles and MULL,MULH takes 6.
>
But those two multiplies still are tossing away 50% of their work.
And if it does fuse them then the internal uArch cost is the same as if
you had designed it optimally from the start, except now you have
to pay for a fuser.
<sound of soap box being dragged out>Agreed
This idea that macro-op fusion is some magic solution is bullshit.
1) It's not free.Far from it.
2) It only works where Decode can see *all* the required lookaheadI think it is but a crutch for a misdesigned ISA
instructions, which means you have to pay for an N-lane decoder
but only get 1 lane.
3) It's probabilistic as it depends on how the fetch buffers get loaded.It can be worse than that
Eg if the fetch buffer contains a valid instruction but does not
have
a next instruction, do you stall Decode to see if a fuser might
arrive
or dispatch it anyway.
4) It gets exponentially expensive if you start doing multipleAll the more reason to have a better ISA
instruction
lanes because decode has to deal with all the permutations of
fusion possibilities.
5) Any fused instructions leave (multiple) bubbles that should beOne of the interesting things I have noticed with my ISA is that
compacted out or there wasn't much point to doing the fusion.
In my opinion it is better to have an ISA that is optimal by designIndeed.
rather than being patched up by fusion later.
Some of this inefficiency is caused by clinging to now 40 year oldMakes FMAC had
risc design *guidelines* (ie not even rules) that:
- instructions have at most 1 dest and 2 source registers
- register specifier fields are either source or dest, never bothI happen to be wishywashy on this
- instructions should take at most 1 clock (they never did)This never worked for floating point anyway...and many consider
These self imposed design restrictions cause ISA designers to missAnd it is just "so unnecessary".
some possible more optimal solutions. The result is things like
RISC-V's memory reference linkage structures taking 6 instructions
to build a 64-bit PC-relative address. And I'm pretty sure we won't
see any 6 instruction fusers for quite some time.
>
<sound of soap box being dragged back to cupboard>
Les messages affichés proviennent d'usenet.