Sujet : Re: Microarch Club
De : mitchalsup (at) *nospam* aol.com (MitchAlsup1)
Groupes : comp.archDate : 28. Mar 2024, 00:11:34
Autres entêtes
Organisation : Rocksolid Light
Message-ID : <4f63a339527a85e67bcd85c6f5388bfa@www.novabbs.org>
References : 1 2 3 4 5 6 7 8 9 10
User-Agent : Rocksolid Light
Scott Lurndal wrote:
mitchalsup@aol.com (MitchAlsup1) writes:
BGB wrote:
>
On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)
>
OK.
>
The point is they are doing better now after adding IDIV and FDIV.
>
I think both modern ARM and AMD Zen went over to "actually fast" integer divide.
>
I think for a long time, the de-facto integer divide was ~ 36-40 cycles for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I can get from a shift-add unit.
>
While those numbers are acceptable for shift-subtract division (including
SRT variants).
>
What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???
Empirically, the ARM CortexM7 udiv instruction requires 3+[s/2] cycles
(where s is the number of significant digits in the quotient).
I submit that a 5+2×ln8(s) is faster still.
32-bits = 15 cycles <not so much faster>
64-bits = 17 cycles <A lot faster>
{Log base 8, where one uses Newton-Raphson or Goldschmidt to get 8 significant
digits (9.2 bits are correct) and double the significant bits each iteration (2-cycles). }
5 comes from looking at numerator and denominator to find the first bit of
significance, and then shifting numerator and denominator so that the FDIV
algorithm can work.
https://www.quinapalus.com/cm7cycles.html
>
I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.