Vir Campestris <
vir.campestris@invalid.invalid> writes:
On 23/07/2024 15:34, Tim Rentsch wrote:
>
Vir Campestris <vir.campestris@invalid.invalid> writes:
>
[...]
>
I was hoping for some feedback, even if very brief,
before I continue.
>
Sorry, I was slow. It's really odd, I'm retired, and I ought to
have lots of time for things like this... and yet I seem to be
busy all the time.
No worries, I understand.
I'm sorry to be so long in responding. Among other things
there was some sort of snafu with my news server, resulting
in hundreds of lost and/or duplicated messages, and that
took a while to sort out.
On 20/07/2024 15:41, Tim Rentsch wrote:
>
Vir Campestris <vir.campestris@invalid.invalid> writes:
>
On 02/07/2024 07:20, Tim Rentsch wrote:
[...]
Both divide and mod can be simplified if we use a different
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
>
N/30 , N%30
>
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
>
I suspect the cost of extracting the divided value from the 7
bytes will be prohibitive. You may find it's best to have one
table for the N/30 parts (big entries, word aligned) and another
for the N%30 part (small entries, byte aligned). BICBW.
Two tables, absolutely. What does BICBW stand for?
(i*30+a) * (j*30+b)
>
which is
>
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
>
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
>
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
>
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
>
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
>
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
>
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
>
Does this all make sense?
>
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
>
What I had considered is an extrapolation from my original code.
I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.
typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;
static void remove_multiples( Index, Count, Index );
static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };
Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;
bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit );
}
}
return r;
}
static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
};
static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};
static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};
void
remove_multiples( Index i, Count shift_a, Index i_limit ){
Index j, k;
NN a = mods[shift_a], b, c;
Count shift;
for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}
In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)
Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}
We changed the interface to remove_multiples() to pass along the
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).
After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?
The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.
I wanted to be sure you understand how the original idea is
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.
Hope to send response to send part sometime soon.