Vir Campestris <
vir.campestris@invalid.invalid> writes:
This posting is my third (and I think last) response to your
last set of comments.
On 20/07/2024 15:41, Tim Rentsch wrote:
>
Vir Campestris <vir.campestris@invalid.invalid> writes:
>
On 02/07/2024 07:20, Tim Rentsch wrote:
[...]
What I had considered is an extrapolation from my original code.
About the original code.. One part of it took me a while to
figure out. I would summarize it as follows. Rather than
striking all multiples of a given prime before going on to the
next one, you put a limit on the memory of where the strikes
can occur to keep those bytes in the L1 cache. So several or
many primes are processed at once, as long as the products are
inside the target area (which I think is on the order of half a
megabyte). I haven't done any careful measurements, but it
does appear as though this approach provides a significant
performance gain. (I don't have any estimates for how much.)
A cost of using this method is it takes some memory to keep
track of where any non-finished primes are in their multiple
striking. That cost can be relevant if amount of memory needed
starts to be a significant portion of the L1 cache size.
For each prime my original code starts with p^2, and sets the bit.
It than adds p*2, and sets the bit for that. Since I wasn't
storing even numbers every prime is odd, which means p^2 is odd.
p^2+p would be even, so I skip that and go on to p^2+p*2, and so
on.
>
Given the mod30 storage I should be able to make a table with 8
entries, each containing a step and a mask. The step is the
number of bytes in the big table to step to for the next bit to
set; the mask is the bit to set.
>
For 37, for example, the first bits we need to set are these:
>
Mul'ple value /30 Step %30
37 1369 45 19
41 1517 50 5 17
47 1739 57 7 29
49 1813 60 3 13
53 1961 65 5 11
59 2183 72 7 23
61 2257 75 3 7
67 2479 82 7 19
71 2627 87 5 17
77 2849 94 7 29
79 2923 97 3 13
83 3071 102 5 11
89 3293 109 7 23
91 3367 112 3 7
>
>
You'll see the pattern; there is a repeat 8 lines long of both
the Step (which is the delta in the word we need to access) and
the %30 value.
Right. There is a regular pattern of deltas, corresponding to
a kind of strength reduction of doing the multiplies.
So what the code should do is start at p^2, and build this table.
It'll be different for every prime, and will be quite expensive to
build. But once the table is built it can add the step onto the
current table index, then set the bit corresponding to the %30
value. The table should of course contain only the step and the
bitmask corresponding to the %30 value.
Note that the %30 tables can be shared. There are 8 of them, one
for each of the eight residues mod 30 of the original prime.
(I used 37 in this example BTW, because I wanted a value more than
30, and with 31 the /30 column increments by 1 each time I added a
prime)
I get a different result for starting with 31 but that's not
important, I expect one of us made a simple calculation mistake.
If we are looking at a prime p = 30*i + a, then the first
delta for the /30 column is
(30*i+a) * (30*i+b) / 30 - (30*i+a) * (30*i+a) / 30
which is
(30*i*i + i*b + i*a + a*b/30) -
(30*i*i + i*a + i*a + a*a/30)
which is
i*(b-a) + (a*b/30 - a*a/30)
when we get to 8 elements later, the /30 column is
(30*i+a) * (30*(i+1)+b) / 30 - (30*i+a) * (30*(i+1)+a) / 30
which is
(30*i*i + i + i*b + i*a + a + a*b/30) -
(30*i*i + i + i*a + i*a + a + a*a/30)
which is again
i*(b-a) + (a*b/30 - a*a/30)
that is, the same delta as the one eight steps before. This
formula shows us how to calculate the deltas, with the help of one
table for the (a*b'/30 - a*b/30) values, and one table for the
(b'-b) values, which need to be multiplied by the appropriate i
value (namely, p/30).
So I think the delta calculation can be written
i*delta_b[x] + delta_bb[x]
where x goes in a circular loop over 8 values [ x0 ... x7 ].
All the above assuming I haven't made an algebraic mistakes, which
certainly is not guaranteed.
Besides showing how to calculate the deltas, the last formula shows
how to reduce the amount of memory needed to save the state of
processing a given prime. Namely, all that is needed is the
running value of what byte holds the product bit, the index i of
the original prime, and the value for x (three bits) along with
something that shows what range it cycles through (another three
bits). It should be easy to hold all that state in two 64-bit
words.
If instead we tried to store a table with 8 entries for each prime
that is still being processed, that would use a lot more memory and
eat into our L1 cache memory budget. At some point we have to
wonder if it would be cheaper to do more recalculation but need
less memory to keep track of where we are. Not a simple problem.
Forgive me if the above seems a bit scattered. I started with a
rough roadmap and just dived in and began writing. Feel free to
ask about anything that isn't clear or doesn't make sense.