Sujet : Re: Instruction Tracing
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 10. Aug 2024, 19:33:36
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <4982caec9dafb4d0dac0e86a85220e56@www.novabbs.org>
References : 1 2
User-Agent : Rocksolid Light
On Sat, 10 Aug 2024 10:18:02 +0000, Anton Ertl wrote:
Lawrence D'Oliveiro <ldo@nz.invalid> writes:
One thing these instruction traces would frequently report is that
integer
multiply and divide instructions were not so common, and so could be
omitted and emulated in software, with minimal impact on overall
performance. We saw this design decision taken in the early versions of
Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
>
Alpha and IA-64 have no integer division. IIRC IA-64 has no FP
division.
"Stupid is a stupid does" Forest Gump.
{and applicable, too}}
One interesting aspect of RISC-V is that they put multiplication and
division in the same extension (which is included in RV64G, i.e., the
General version of RISC-V).
>
Later, it seems, the CPU designers realized that instruction traces were
not the final word on performance measurements, and started to include
hardware integer multiply and divide instructions.
>
When you invest more hardware to increase performance per cycle, at
one point the best return on investment is to have multiplication and
division instructions. What is interesting is that the multipliers
have than soon been fully pipelined.
The MUL unit of Mc88100 was fully pipelined (1985) Integer multiply was
3 cycles, single was 4 cycles, double was 7 IIRC.
Or, as Mitch Alsup reports, in
cases where that was cheaper, have two half-pipelined multipliers.
When the multiplier tree delay is greater than 1 cycle, it becomes
cheaper to have 2×½ multipliers without a stage delay than to have
1 multiplier with 4096 flip-flops in the middle. Where cheaper is
smaller and consumes less power.
Apparently there are enough applications that require a huge number of
multiplications; my guess is that the NSA won't tell us what they are.
AES is greatly sped up with a carry-less multiplication, all one has to
do is to deactivate the majority gate in the CAS cell (which adds no
gates of delay or area.)
>
- anton