Newsportal USENET - Re: Microarch Club

BGB wrote:

On 3/26/2024 5:27 PM, Michael S wrote:
For slightly less then 20 years ARM managed OK without integer divide.
Then in 2004 they added integer divide instruction in ARMv7 (including
ARMv7-M variant intended for small microcontroller cores like
Cortex-M3) and for the following 20 years instead of merely OK they are
doing great :-)

OK.

The point is they are doing better now after adding IDIV and FDIV.

I think both modern ARM and AMD Zen went over to "actually fast" integer divide.

I think for a long time, the de-facto integer divide was ~ 36-40 cycles for 32-bit, and 68-72 cycles for 64-bit. This is also on-par with what I can get from a shift-add unit.

While those numbers are acceptable for shift-subtract division (including
SRT variants).
What I don't get is the reluctance for using the FP multiplier as a fast
divisor (IBM 360/91). AMD Opteron used this means to achieve 17-cycle
FDIS and 22-cycle SQRT in 1998. Why should IDIV not be under 20-cycles ??
and with special casing of leading 1s and 0s average around 10-cycles ???
I submit that at 10-cycles for average latency, the need to invent screwy
forms of even faster division fall by the wayside {accurate or not}.
NOTE well:: The size of the FMUL (or FMAC) unit does increase, but its
increase is less than that of an STR divisor unit.

On my BJX2 core, it is currently similar (36 and 68 cycle for divide).
This works out faster than a generic shift-subtract divider (or using a runtime call which then sorts out what to do).

This is because you are using a linear iteration, try using a quadratic
convergent iteration instead. OH but you CAN'T because your multiplier
tree does not give accurate lower order bits.

A special case allows turning small divisors internally into divide-by-reciprocal, which allows for a 3-cycle divide special case. But, this is a LUT cost tradeoff.

It could be possible in theory to support a general 3-cycle integer divide, albeit if one can accept inexact results (would be faster than the software-based lookup table strategy).

But, it is debatable. Pure minimalism would likely favor leaving out divide (and a bunch of other stuff). Usual rationale being, say, to try to fit the entire ISA listing on a single page of paper or similar (vs having a listing with several hundred defined encodings).

Nevermind if the commonly used ISAs (x86 and 64-bit ARM) have ISA listings that are considerably larger (thousands of encodings).

....

Date	Sujet	#	Auteur
21 Mar 24	Microarch Club	22	George Musk
25 Mar 24	Re: Microarch Club	21	BGB-Alt
26 Mar 24	Re: Microarch Club	20	MitchAlsup1
26 Mar 24	Re: Microarch Club	19	BGB
26 Mar 24	Re: Microarch Club	18	MitchAlsup1
26 Mar 24	Re: Microarch Club	17	BGB-Alt
27 Mar 24	Re: Microarch Club	12	Michael S
27 Mar 24	Re: Microarch Club	11	BGB
27 Mar 24	Re: Microarch Club	10	MitchAlsup1
28 Mar 24	Re: Microarch Club	4	Michael S
2 Apr 24	Re: Microarch Club	3	BGB-Alt
5 Apr 24	Re: Microarch Club	2	MitchAlsup1
6 Apr 24	Re: Microarch Club	1	BGB
28 Mar 24	Re: Microarch Club	1	MitchAlsup1
28 Mar 24	Re: Microarch Club	4	Terje Mathisen
28 Mar 24	Re: Microarch Club	3	Michael S
29 Mar 24	Re: Microarch Club	2	Terje Mathisen
29 Mar 24	Re: Microarch Club	1	Michael S
27 Mar 24	Re: Microarch Club	4	MitchAlsup1
27 Mar 24	Re: Microarch Club	3	BGB
27 Mar 24	Re: Microarch Club	2	MitchAlsup1
1 Apr 24	Re: Microarch Club	1	BGB