Liste des Groupes | Revenir à c arch |
MitchAlsup1 wrote:On Wed, 18 Sep 2024 21:15:55 +0000, Brett wrote:
EricP <ThatWouldBeTelling@thevillage.com> wrote:Terje Mathisen wrote:EricP wrote:I always assumed that MULH just grabbed the part that would have been
thrown away. And that is how at least one RISC-V core does it:
https://www.digikey.com/en/blog/how-the-risc-v-multiply-extension-adds-an-efficient-32-bit
They claim 5 cycles, should be six, five for the multiply and one more
for the second result, unless the next instruction does not need a write
port, and does not use the result. You can get a throughput of 5 cycles
with
smart coding, but that rarely happens without effort.
It is easy enough in the decoder to recognize a MUL followed by MULH
(and vice versa) as using the multiplier tree once and delivering 2
results. So the first result is 6 cycles, the second result on the 6th
cycle. {you ALMOST have to do this to avoid large wastes in power.}
Yes, but then you *require* a macro-op fuser to function efficiently.
Probably... assuming it works.
OR one can give up the cherished 1-dest,2-source self imposed ISA design
limitation and have a 32-bit instruction with four 5-bit registers,
2 source, 2 dest, leaving 12 bits for opcode and function code
that you know will calculate multiply once, and can write back
the result in 1 clock if it has two write ports (which it needs
anyway if it wants any hope of catching up after a stall bubble).
Also in the case of Alpha they only had unsigned MUL,MULH and
for signed multiply it had to use branchy code (pre-CMOV) to
do the signed correction subtracts, so fusion would be too complex.
That design decision is as baffling as HP-PA originally leaving
a MUL instruction out entirely because "it violated the 1-clock per
instruction design philosophy". (HP quickly fixed it, but still...)
Les messages affichés proviennent d'usenet.