Newsportal USENET - Re: OT: Re: Sieve of Erastosthenes optimized to the max

Vir Campestris <vir.campestris@invalid.invalid> writes:

I'm going to come back to this slightly earlier version, and add
comments inline.

Okay good deal. I'm taking out some parts that look like they
are no longer relevant.

On 10/08/2024 15:07, Tim Rentsch wrote:
>
Vir Campestris <vir.campestris@invalid.invalid> writes:
>
On 20/07/2024 15:41, Tim Rentsch wrote:
>
Vir Campestris <vir.campestris@invalid.invalid> writes:
>
On 02/07/2024 07:20, Tim Rentsch wrote:
>
[...]
>
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
>
   N/30 , N%30
>
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
>
   (i*30+a) * (j*30+b)
>
which is
>
   (i*30*j*30) + (i*30*b) + (j*30*a) + a*b
>
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
>
   30*i*j + i*b + j*a + (a*b)/30,   (a*b)%30
>
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
>
   30*i*j + i*b + j*a + divide_30[a][b],   modulo_30[a][b]
>
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
>
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
>
Does this all make sense?
>
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
>
What I had considered is an extrapolation from my original code.
>
I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.
>
   typedef unsigned long NN, Index, Count, Mask;
   typedef unsigned char Byte;
>
I've been using uint64_t, not unsigned long. When I started C
unsigned long was 32 bits; I'm never quite sure what you'll get.

I'm coming around to the point of view that size_t is the best
choice for these types. In any case though the point of using a
typedef is so the more meaningful names can be used but still
changed easily where they are defined.

   static void remove_multiples( Index, Count, Index );
>
   static Byte bytes[1000000000];
   static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };
>
   Count
   sieve( NN limit ){
   Count r = 0;
   Index i;
   Count shift;
   Index i_limit = (limit + 29)/30;
>
   bytes[0] = 1;
   for( i = 0; i < i_limit; i++ ){
   Byte bits = bytes[i];
   for( shift = 0; shift < 8; shift++ ){
   NN p = i*30 + mods[shift];
   if( bits & 1<<shift ) continue;
   if( p >= limit ) continue;
   r++; // count how many primes found
   // could print the prime 'p' here if desired
   if( p*p < limit ) remove_multiples( i, shift, i_limit );
   }
   }
>
   return r;
   }
>
   static Byte shift_for[30] = {
   9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
   };
>
static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};
>
static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};
>
constexpr on those tables might give the optimiser something else to
play with.

All of these tables are going to disappear in the next revision.

void
remove_multiples( Index i, Count shift_a, Index i_limit ){
>
So i is (prime/30) and shift_a is the bit in the mask byte -
corresponding to (prime%30)?

Right. The value 'a' is 'mods[shift_a]', and the value 'shift_a' is
'shift_for[a]'. Again though we will be dispensing with shift_for[]
and put in code that can be compiled to give constant values.

   Index j, k;
   NN a = mods[shift_a], b, c;
   Count shift;
>
   for( shift = shift_a; shift < 8; shift++ ){
   j = i;
   b = mods[shift];
   k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
   c = ab_mod_30[ shift_a ][ shift ];
   if( k >= i_limit ) continue;
   bytes[k] |= 1 << shift_for[c];
   }
>
   for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
   for( shift = 0; shift < 8; shift++ ){
   b = mods[shift];
   k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
   c = ab_mod_30[ shift_a ][ shift ];
   if( k >= i_limit ) continue;
   bytes[k] |= 1 << shift_for[c];
   }
   }
   }
>
In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)
>
Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):
>
   for( i = 0; i < i_limit; i++ ){
   Byte bits = bytes[i] ^ 0xFF;
   if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
   if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
   if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
   if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
   if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
   if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
   if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
   if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
   }
>
Before unrolling always benchmark. (Modern compilers are pretty smart,
and sometimes will unroll for you. Modern processors with branch
prediction are pretty good about working out when to jump and when not
to. And with speculative execution they sometimes run both paths from
a branch... I could go on!)

I understand the advice. Keep in mind though that the code I'm
showing (and also true in the next version) is presented as a way of
explaining how I got to where I got. I didn't start with the loop
version and unroll it; everything was unrolled right from the get
go. And there are more revisions to come, even after the next one
(to be shown below).

We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).
>
After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?
>
The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.
>
And now a few minutes later as I read that I understand the reason
for the unrolling better. But my point still stands - benchmark
it.
>
Of course the decision on whether to unroll or not might depend on
the target CPU!

For this approach to work it's crucial to unroll completely to make
sure all the divides and remainders, and also the mask choices, get
optimized out. I suppose it's possible that some architecture out
there could be faster without all constant folding, but it seems
unlikely, and in any case I made an early decision to write the code
using this approach, which has worked out well.

I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.
>
Hope to send response to send part sometime soon.
>
You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was too
expensive. But it's also (I think, I haven't coded it) to work out
whether you need to step 2, 4, 8 times the prime. And that would save
more memory.

The possible steps for mod30 are 2 or 4 or 6 times the prime.

Another thought BTW - whether it's worth storing with a larger modulo
- say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
whether the steps between primes should take that into consideration.

My guess is that the possible steps are still just 2 or 4 or 6
(times the prime), but I haven't checked that.

Now here is revised code where the unrolling has been done, and also
uses optimizable mask production:

void
mod30_sieve_x( Index N, NN limit, Index i_limit ){
Index i;

bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Bits cbits = bytes[i];
Bits pbits = cbits ^ -1;
if( pbits & 0001 ) remove_multiples_x( N, i, 1, pbits );
if( pbits & 0002 ) remove_multiples_x( N, i, 7, pbits );
if( pbits & 0004 ) remove_multiples_x( N, i, 11, pbits );
if( pbits & 0010 ) remove_multiples_x( N, i, 13, pbits );
if( pbits & 0020 ) remove_multiples_x( N, i, 17, pbits );
if( pbits & 0040 ) remove_multiples_x( N, i, 19, pbits );
if( pbits & 0100 ) remove_multiples_x( N, i, 23, pbits );
if( pbits & 0200 ) remove_multiples_x( N, i, 29, pbits );
}
}

#define set_product_x( i, a, j, b ) ( \
bytes[ i*j*30 + i*b + j*a + a*b/30 ] |= BIT_FOR( (a*b%30) ) \
)

#define BIT_FOR( m ) ( \
m == 1 ? 0001 : m == 7 ? 0002 : m == 11 ? 0004 : m == 13 ? 0010 : \
m == 17 ? 0020 : m == 19 ? 0040 : m == 23 ? 0100 : m == 29 ? 0200 : \
0 \
)

void
remove_multiples_x( Index N, Index i, NN a, Bits ibits ){
NN p = 30*i + a;
Index j;
Index jn = (N*30/p + 29)/30;

switch( a ){
case 1: set_product_x( i, a, i, 1 );
case 7: set_product_x( i, a, i, 7 );
case 11: set_product_x( i, a, i, 11 );
case 13: set_product_x( i, a, i, 13 );
case 17: set_product_x( i, a, i, 17 );
case 19: set_product_x( i, a, i, 19 );
case 23: set_product_x( i, a, i, 23 );
case 29: set_product_x( i, a, i, 29 );
}

for( j = i+1; j < jn; j++ ){
set_product_x( i, a, j, 1 );
set_product_x( i, a, j, 7 );
set_product_x( i, a, j, 11 );
set_product_x( i, a, j, 13 );
set_product_x( i, a, j, 17 );
set_product_x( i, a, j, 19 );
set_product_x( i, a, j, 23 );
set_product_x( i, a, j, 29 );
}
}

Some comments about the upper bounds of the loops:

First, assume that 'i_limit' has been computed in conjunction with
the determination of the parameter N so that it is exactly right.

Second, the initializing expression for 'jn' is safe but may miss
part of one more value of j. There are different ways to take care
of that, but it's a minor detail so I'm skipping over it for now.
Of course feel free to think about how that could be done if you
want to, and if not then no worries, I expect to say more about that
in a later followup.

Date	Sujet	#	Auteur
3 Jul 24	Re: Sieve of Erastosthenes optimized to the max	51	Vir Campestris
15 Jul 24	Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
20 Jul 24	Re: Sieve of Erastosthenes optimized to the max	48	Tim Rentsch
25 Jul 24	OT: Re: Sieve of Erastosthenes optimized to the max	47	Vir Campestris
10 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	44	Tim Rentsch
12 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	2	Vir Campestris
16 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
15 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	41	Vir Campestris
16 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	40	Tim Rentsch
16 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	38	Bonita Montero
16 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
19 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	19	Vir Campestris
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	18	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	17	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	15	Vir Campestris
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
26 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	13	Tim Rentsch
27 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	12	Bonita Montero
1 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	11	Vir Campestris
2 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	10	Tim Rentsch
2 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
3 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	8	Vir Campestris
28 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	7	Tim Rentsch
28 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	6	Keith Thompson
2 Oct 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	5	Vir Campestris
2 Oct 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
7 Oct 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	3	Tim Rentsch
20 Oct 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	2	Vir Campestris
4 Nov 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
19 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	12	Vir Campestris
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	4	red floyd
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	2	Vir Campestris
26 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
26 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	3	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Bonita Montero
20 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	4	Bonita Montero
22 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	3	Vir Campestris
22 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	2	Bonita Montero
22 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Vir Campestris
22 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	5	Vir Campestris
23 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	3	red floyd
26 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
28 Sep 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Andrey Tarasevich
26 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
19 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
11 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
11 Aug 24	Re: OT: Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch
23 Jul 24	Re: Sieve of Erastosthenes optimized to the max	1	Tim Rentsch