Liste des Groupes | Revenir à cl c++ |
I'm going to come back to this slightly earlier version, and add
comments inline.
On 10/08/2024 15:07, Tim Rentsch wrote:
>Vir Campestris <vir.campestris@invalid.invalid> writes:>
>On 20/07/2024 15:41, Tim Rentsch wrote:>
>Vir Campestris <vir.campestris@invalid.invalid> writes:
>On 02/07/2024 07:20, Tim Rentsch wrote:
[...]
>>Both divide and mod can be simplified if we use a different>
representation for numbers. The idea is to separate the
number into a mod 30 part and divided by 30 part. An
arbitrary number N can be split into an ordered pair
>
N/30 , N%30
>
stored in, let's say, an 8 byte quantity with seven bytes
for N/30 and one byte for N%30. Let me call these two
parts i and a for one number N, and j and b for a second
number M. Then two form N*M we need to find
>
(i*30+a) * (j*30+b)
>
which is
>
(i*30*j*30) + (i*30*b) + (j*30*a) + a*b
>
Of course we want to take the product and express it in
the form (product/30, product%30), which can be done by
>
30*i*j + i*b + j*a + (a*b)/30, (a*b)%30
>
The two numbers a and b are both under 30, so a*b/30 and
a*b%30 can be done by lookup in a small array.
>
30*i*j + i*b + j*a + divide_30[a][b], modulo_30[a][b]
>
Now all operations can be done using just multiplies and
table lookups. Dividing by 30 is just taking the first
component; remainder 30 is just taking the second component.
>
One further refinement: for numbers relatively prime to 2,
3, and 5, there are only 8 possibles, so the modulo 30
component can be reduced to just three bits. That makes the
tables smaller, at a cost of needing a table lookup to find
and 'a' or 'b' part.
>
Does this all make sense?
Yes, it does. It's not a direction I was thinking of at all,
although I too had table lookups in mind.
>
What I had considered is an extrapolation from my original code.
I read the comments below in conjunction with getting a better
understanding of your original code, which is part of what took
me so long before responding. Before I get to the stuff below
(which I think I will save for a second reply), here is some code
expanding on what I said above. It should compile cleanly as is.
However, its purpose is to illustrate what I said above.
>
typedef unsigned long NN, Index, Count, Mask;
typedef unsigned char Byte;
I've been using uint64_t, not unsigned long. When I started C
unsigned long was 32 bits; I'm never quite sure what you'll get.
static void remove_multiples( Index, Count, Index );>
>
static Byte bytes[1000000000];
static NN mods[] = { 1, 7, 11, 13, 17, 19, 23, 29, };
>
Count
sieve( NN limit ){
Count r = 0;
Index i;
Count shift;
Index i_limit = (limit + 29)/30;
>
bytes[0] = 1;
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i];
for( shift = 0; shift < 8; shift++ ){
NN p = i*30 + mods[shift];
if( bits & 1<<shift ) continue;
if( p >= limit ) continue;
r++; // count how many primes found
// could print the prime 'p' here if desired
if( p*p < limit ) remove_multiples( i, shift, i_limit );
}
}
>
return r;
}
>
static Byte shift_for[30] = {
9,0,9,9,9,9, 9,1,9,9,9,2, 9,3,9,9,9,4, 9,5,9,9,9,6, 9,9,9,9,9,7,
};
>
static Byte ab_div_30[8][8] = {
{ 1*1/30, 1*7/30, 1*11/30, 1*13/30, 1*17/30, 1*19/30, 1*23/30, 1*29/30 },
{ 7*1/30, 7*7/30, 7*11/30, 7*13/30, 7*17/30, 7*19/30, 7*23/30, 7*29/30 },
{ 11*1/30,11*7/30,11*11/30,11*13/30,11*17/30,11*19/30,11*23/30,11*29/30 },
{ 13*1/30,13*7/30,13*11/30,13*13/30,13*17/30,13*19/30,13*23/30,13*29/30 },
{ 17*1/30,17*7/30,17*11/30,17*13/30,17*17/30,17*19/30,17*23/30,17*29/30 },
{ 19*1/30,19*7/30,19*11/30,19*13/30,19*17/30,19*19/30,19*23/30,19*29/30 },
{ 23*1/30,23*7/30,23*11/30,23*13/30,23*17/30,23*19/30,23*23/30,23*29/30 },
{ 29*1/30,29*7/30,29*11/30,29*13/30,29*17/30,29*19/30,29*23/30,29*29/30 },
};
>
static Byte ab_mod_30[8][8] = {
{ 1*1%30, 1*7%30, 1*11%30, 1*13%30, 1*17%30, 1*19%30, 1*23%30, 1*29%30 },
{ 7*1%30, 7*7%30, 7*11%30, 7*13%30, 7*17%30, 7*19%30, 7*23%30, 7*29%30 },
{ 11*1%30,11*7%30,11*11%30,11*13%30,11*17%30,11*19%30,11*23%30,11*29%30 },
{ 13*1%30,13*7%30,13*11%30,13*13%30,13*17%30,13*19%30,13*23%30,13*29%30 },
{ 17*1%30,17*7%30,17*11%30,17*13%30,17*17%30,17*19%30,17*23%30,17*29%30 },
{ 19*1%30,19*7%30,19*11%30,19*13%30,19*17%30,19*19%30,19*23%30,19*29%30 },
{ 23*1%30,23*7%30,23*11%30,23*13%30,23*17%30,23*19%30,23*23%30,23*29%30 },
{ 29*1%30,29*7%30,29*11%30,29*13%30,29*17%30,29*19%30,29*23%30,29*29%30 },
};
constexpr on those tables might give the optimiser something else to
play with.
void>
remove_multiples( Index i, Count shift_a, Index i_limit ){
So i is (prime/30) and shift_a is the bit in the mask byte -
corresponding to (prime%30)?
Index j, k;>
NN a = mods[shift_a], b, c;
Count shift;
>
for( shift = shift_a; shift < 8; shift++ ){
j = i;
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
>
for( j = i+1; (30UL*i+a)*(30UL*j+1) < i_limit * 30; j++ ){
for( shift = 0; shift < 8; shift++ ){
b = mods[shift];
k = i*j*30 + i*b + j*a + ab_div_30[ shift_a ][ shift ];
c = ab_mod_30[ shift_a ][ shift ];
if( k >= i_limit ) continue;
bytes[k] |= 1 << shift_for[c];
}
}
}
>
In both sieve() and remove_multiples(), there is an inner loop
that ranges over the possible shift values. (There is also an
unnested loop in remove_multiples() that ranges over a subset of
shift values, but for now please ignore that.)
>
Since these loops always perform exactly eight iterations, they
can be completely unrolled without expanding the code too much.
Here is a revised version of the outer loop in sieve() (actually
the code below is not yet ready for compilation, because some
things have changed):
>
for( i = 0; i < i_limit; i++ ){
Byte bits = bytes[i] ^ 0xFF;
if( bits & 0001 ) remove_multiples( i_limit, i, 1 ), r++;
if( bits & 0002 ) remove_multiples( i_limit, i, 7 ), r++;
if( bits & 0004 ) remove_multiples( i_limit, i, 11 ), r++;
if( bits & 0010 ) remove_multiples( i_limit, i, 13 ), r++;
if( bits & 0020 ) remove_multiples( i_limit, i, 17 ), r++;
if( bits & 0040 ) remove_multiples( i_limit, i, 19 ), r++;
if( bits & 0100 ) remove_multiples( i_limit, i, 23 ), r++;
if( bits & 0200 ) remove_multiples( i_limit, i, 29 ), r++;
}
Before unrolling always benchmark. (Modern compilers are pretty smart,
and sometimes will unroll for you. Modern processors with branch
prediction are pretty good about working out when to jump and when not
to. And with speculative execution they sometimes run both paths from
a branch... I could go on!)
We changed the interface to remove_multiples() to pass along the>
index value 'i' and also the value for 'a' rather than the value
calculated for the prime 'p'. Since the loop has been unrolled,
the values for the 'a' arguments are all constants (and of course
all the mask values are constants).
>
After making a similar change to the remove_multiples() loop, the
values for the 'b's are likewise all constants. If the optimizer
is smart enough, all the 'a*b/30' values and 'a*b%30' values turn
into constants. Also the expressions 'i*b' and 'j*a' all turn
into multiplication by a constant. Furthermore, to determine the
appropriate bit value to 'or' in, an eight-choice conditional
expression that tests each of the eight possible residues can
similarly be optimized away so the masks to 'or' in are all
constants. Do you see how that all works?
>
The point of all this unrolling is to get rid of all the shifting
and table lookups. The only variables involved are the loop
variables 'i' and 'j'; everything else has been turned into a
constant. Granted, the generated code has been expanded a great
deal (basically a factor of 8*8 == 64), but the pieces are now
much simpler so the result is still manageable.
And now a few minutes later as I read that I understand the reason
for the unrolling better. But my point still stands - benchmark
it.
>
Of course the decision on whether to unroll or not might depend on
the target CPU!
I wanted to be sure you understand how the original idea is>
transformed into a better performing version before posting
something more refined. I get that your time is at a premium,
still you might trying going through the exercise of unrolling
the loops and writing a set_product() routine (or macro) to get a
sense of where I'm going here.
>
Hope to send response to send part sometime soon.
You seem to have spotted a pattern in the steps that I didn't - I
looked at the mask values, and decided it wasn't worth working out
which of the 8 possible sequences to use - remembering which was too
expensive. But it's also (I think, I haven't coded it) to work out
whether you need to step 2, 4, 8 times the prime. And that would save
more memory.
Another thought BTW - whether it's worth storing with a larger modulo
- say 2*3*5*7 - not just 2*3*5 is a different decision from deciding
whether the steps between primes should take that into consideration.
Les messages affichés proviennent d'usenet.