Sujet : Re: Fast division (was Re: Suggested method for returning a string from a C program?)
De : janis_papanagnou+ng (at) *nospam* hotmail.com (Janis Papanagnou)
Groupes : comp.lang.cDate : 26. Mar 2025, 14:42:44
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <vs108m$1scgt$1@dont-email.me>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0
On 26.03.2025 03:37, Waldek Hebisch wrote:
Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
On 22.03.2025 15:07, Waldek Hebisch wrote:
>
Actually, to do fast division of N-bit number by fixed N-bit number
one need 2N-bit multiplication.
>
I just stumbled across your post and above sentence. Do you mean *one*
multiplication of 2N bit numbers? - Could you please explain that (by
an example, or could you provide a reference)?
One multiplications with 2N bit result + few other operations
like shifts and additions. Consider for example:
unsigned int
divv(unsigned int a) {
return a/1234567;
}
Ah, I suppose you meant that (in case of constant divisors) the
compiler would just _pre-compute_ the reciprocal value (and can
then just multiply, of course). - Is that understanding correct?
Given that I don't know the semantics of the assembler below I'm
just speculating, but - given you used a constant "1234567" - it
seems the compiler used a different value ("3000869427") for the
multiplication, the reciprocal - right?
My gcc-12 at -O generates the following assembly code:
divv:
.LFB0:
.cfi_startproc
movl %edi, %eax
movl $3000869427, %edx
imulq %rdx, %rax
shrq $32, %rax
subl %eax, %edi
shrl %edi
addl %edi, %eax
shrl $20, %eax
ret
So 1 multiplication, 3 shifts, subtraction and addtion. When
more bits of multiplication are available one can use smaller
number of extra operations. For example, when I change
function above so that argument is 16-bit and divisor is
12345, the code is:
divv:
.LFB0:
.cfi_startproc
movzwl %di, %eax
imull $43489, %eax, %eax
shrl $29, %eax
ret
So just 1 multiplication and 1 shift.
The idea beside this is quite simple: instead of division we
multiply by reciprocal.
First I thought that your formulation implies that there's a trick
where _any_ division a/b could be expressed by a multiplication
(and some primitive binary operations). - Which would have been cool.
Reciprocal is represented in fixed
point aritihemtic (so normally there is at least one shift to get
binary point in right place). Since divisor is fixed reciprocal can
be precomputed. This works best when there is enough accuracy,
otherwise one needs to add extra steps. Working out how
many bits of accuracy are needed is a bit tricky, in particular
by adding extra operations one can lower needed accuracy.
(The reason for my question is that for integer divisions of length N
an old DSP I used required besides shifts effectively N subtractions
to create the result and modulus; it didn't use any multiplications.)
Method above is good when you have fast and wide multiplier (compared
to your numbers). Also there is cost of precomputation,
to gain divisor must be fixed or at least change slowly.
If you have varying divisor then one can use Newton method
(IIUC this is what modern CPU-s use). Shifts and subtractions
are good when you do not have fast multiplier.
Actually the (meanwhile very old) DSP I used (in the mid 1980's) had
a fast multiplier; + and - were done in one cycle, * (asymptotically)
also one cycle. The interesting part was that the [general] non-const
division was "constructed" (as opposed to pre-existing as an opcode).
All operations (roughly) required just one cycle but the division; it
was constructed by a 'repeat' opcode for the next opcode. So if you
had written something like LACC addr1 ; RPTK 16 ; SUBC addr2 that
would have divided the data *addr1 / *addr2 (informally written)
providing both results, the division and the modulus.
Out of interest I've implemented the logic in "C" to demonstrate it
(the variable names reflect the DSP terminology and the sizes of the
data words (16) and accumulator (32), only the '<< 15' is originally
part of the 'subc' opcode/"C" function (but shifts did not cost any
extra cycles, they've been done in passing, sort of):
void subc (int32_t * acc, int32_t * dma)
{
int32_t acc_out;
if ((acc_out = *acc - *dma) >= 0)
*acc = acc_out << 1 | 1;
else
*acc <<= 1;
}
...
int16_t z = 47;
int16_t n = 6;
// calculate "z div n" and "z mod n"
int32_t acc = z;
int32_t dma = n << 15;
for (int i=1; i<=16; i++)
subc (&acc, &dma);
printf ("%d div %d = %d\n", z, n, acc&0xffff);
printf ("%d mod %d = %d\n", z, n, acc>>16);
I considered that DSP feature quite interesting. That was from a time
where the usual processors required something like 4 cycles for + and
-, around 16 cycles for *, and 80 cycles for the division. (Numbers
are just faint memories from an 68k CPU, so correct me if I'm wrong.)
Janis