Re: Meta: a usenet server just for sci.math

Liste des GroupesRevenir à cp threads 
Sujet : Re: Meta: a usenet server just for sci.math
De : ross.a.finlayson (at) *nospam* gmail.com (Ross Finlayson)
Groupes : sci.math news.software.nntp
Date : 28. Mar 2024, 05:05:44
Autres entêtes
Message-ID : <-bOdnWSSIMUKcZn7nZ2dnZfqnPednZ2d@giganews.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
User-Agent : Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0
On 03/26/2024 06:04 PM, Ross Finlayson wrote:
arithmetic hash searches
>
take a hashcode, split it up
>
invert each arithmetically, find intersection in 64 bits
>
fill in those
>
detect misses when the bits don't intersect the search
>
when all hits, then "refine", next double range,
>
compose those naturally by union
>
when definite misses excluded then go find matching partition
>
arithmetic partition hash
>
So, the idea is, that, each message ID, has applied a uniform
hash, then that it fills a range, of so many bits.
>
Then, its hash is split into smaller chunks the same 1/2/3/4
of the paths, then those are considered a fixed-point fraction,
of the bits set of the word width, plus one.
>
Then, sort of pyramidally, is that in increasing words, or doubling,
is that a bunch of those together, mark those words,
uniformly in the range.
>
For example 0b00001111, would mark 0b00001000, then
0b0000000010000000, and so on, for detecting whether
the hash code's integer value, is in the range 15/16 - 16/16.
>
The idea is that the ranges this way compose with binary OR,
then that a given integer, then that the integer, can be
detected to be out of the range, if its bit is zero, and then
otherwise that it may or may not be in the range.
>
0b00001111 number N1
0b00001000 range R1
0b00000111 number N2
0b00000100 range R2
>
0b00001100 union range UR = R1 | R2 | ....
>
>
missing(N) {
return (UR & RN == 0);
}
>
>
This sort of helps where, in a usual hash map, determining
that an item doesn't exist, is worst case, while the usual
finding the item that exists is log 2, then that usually its value
is associated with that, besides.
>
Then, when there are lots of partitions, and they're about
uniform, it's expected the message ID to be found in only
one of the partitions, is that the partitions can be organized
according to their axes of partitions, composing the ranges
together, then that search walks down those, until it's either
a definite miss, or an ambiguous hit, then to search among
those.
>
It seems then for each partition (group x date), then those
can be composed together (group x month, group x year,
groups x year, all), so that looking to find the group x date
where a message ID is, results that it's a constant-time
operation to check each of those, and the data structure
is not very large, with regards to computing the integers'
offset in each larger range, either giving up when it's
an unambiguous miss or fully searching when it's an
ambiguous hit.
>
This is where, the binary-tree that searches in log 2 n,
worst-case, where it's balanced and uniform, though
it's not to be excluded that a usual hashmap implementation
is linear in hash collisions, is for excluding partitions,
in about constant time and space given that it's just a
function of the number of partitions and the eventual
size of the pyramidal range, that instead of having a
binary tree with space n^2, the front of it has size L r
for L the levels of the partition pyramid and r the size
of the range stamp.
>
Then, searching in the partitions, seems it essentially
results, that there's an ordering of the message IDs,
so there's the "message IDs" file, either fixed-length-records
or with an index file with fixed-length-records or otherwise
for reading out the groups' messages, then another one
with the message ID's sorted, figuring there's a natural
enough binary search of those with value identity, or bsearch
after qsort, as it were.
>
So, the idea is that there's a big grid of group X date archives,
each one of those a zip file, with being sort of contrived the
zip files, so that each entry is self-contained, and it sort of
results that concatenating them results another. So
anyways, the idea then is for each of those, for each of
their message IDs, to compute its four integers, W_i,
then allocate a range, and zero it, then saturate each
bit, in each range for each integer. So, that's like, say,
for fitting the range into 4K, for each partition, with
there being 2^8 of those in a megabyte, or that many
partitions (512), or about a megabyte in space for each
partition, but really where these are just variables,
because it's opportunistic, and the ranges can start
with just 32 or 64 bits figuring that most partitions
are sparse, also, in this case, though usually it would
be expected they are half-full.
>
There are as many of these ranges as the hash is split
into numbers, is the idea.
>
Then the idea is that these ranges are pyramidal in the
sense, that when doing lookup for the ID, is starting
from the top of the pyramid, projecting the hash number
into the range bit string, with one bit for each sub-range,
so it's branchless, and'ing the number bits and the partition
range together, and if any of the hash splits isn't in the
range, a branch, dropping the partition pyramid, else,
descending into the partition pyramid.
>
(Code without branches can go a lot faster than
code with lots of branches, if/then.)
>
At each level of the pyramid, it's figured that only one
of the partitions will not be excluded, except for hash
collisions, then if it's a base level to commence bsearch,
else to drop the other partition pyramids, and continue
with the reduced set of ranges in RAM, and the projected
bits of the ID's hash integer.
>
The ranges don't even really have to be constant if it's
so that there's a limit so they're under a constant, then
according to uniformity they only have so many, eg,
just projecting out their 1's, so the partition pyramid
digging sort of always finds one or more partitions
with possible matches, those being hash collisions or
messages duplicated across groups, and mostly finds
those with exclusions, so that it results reducing, for
example that empty groups are dropped right off
though not being skipped, while full groups then
get into needing more than constant space and
constant time to search.
>
Of course if all the partitions miss then it's
also a fast exit that none have the ID.
>
So, this, "partition pyramid hash filter", with basically,
"constant and configurable space and time", basically
has that because Message Id's will only exist in one or
a few partitions, and for a single group and not across
about all groups, exactly one, and the hash is uniform, so
that hash collisions are low, and the partitions aren't
overfilled, so that hash collisions are low, then it sort
of results all the un-used partitions at rest, don't fill
up in n^2 space the log 2 n hash-map search. Then,
they could, if there was spare space, and it made sense
that in the write-once-read-many world it was somehow
many instead of never, a usual case, or, just using a
list of sorted message Id's in the partition and bsearch,
this can map the file without loading its contents in
space, except as ephemerally, or the usual disk controller's
mmap space, or "ready-time" and "ephemeral-space".
>
In this sort of way there's no resident RAM for the partitions
except each one with a fixed-size arithmetic hash stamp,
while lookups have a fixed or constant cost, plus then
also a much smaller usual log 2 time / n^2 space trade-off,
while memory-mapping active files automatically caches.
>
>
So, the idea is to combine the BFF backing file format
and LFF library file format ideas, with that the group x date
partitions make the for archive and active partitions,
then to have constant-time/constant-space partition
pyramid arithmetic hash range for lookup, then
ready-time/ephemeral-space lookup in partitions,
then that the maintenance of the pyramid tree,
happens with dropping partitions, while just
accumulating with adding partitions.
>
Yeah, I know that a usual idea is just to make a hash map
after an associative array with log 2 n lookup in n^2 space,
that maintenance is in adding and removing items,
here the idea is to have partitions above items,
and sort of naturally to result "on startup, find
the current partitions, compose their partition pyramid,
then run usually constant-time/constant-space in that
then ready-time/ephemeral-space under that,
maintenance free", then that as active partitions
being written roll over to archive partitions being
finished, then they just get added to the pyramid
and their ranges or'ed up into the pyramid.
>
Hmm... 32K or 2^15 groups, 16K or 2^14 days, or
about 40 years of Usenet in partitions, 2^29,
about 2^8 per megabyte or about 2^20 or one
gigabyte RAM, or, just a file, then memory-mapping
the partition pyramid file, figuring again that
most partitions are not resident in RAM,
this seems a sort of good simple idea to
implement lookup by Message ID over 2^30 many.
>
I mean if "text Usenet for all time is about a billion messages",
it seems around that size.
>
>
So, trying to figure out if this "arithmetic hash range
pyramidal partition" data structure is actually sort of
reasonable, gets into that it involves finding a balance
in what's otherwise a very well-understood trade-off,
in terms of the cost of a lookup, over time, and then
especially as whether an algorithm is "scale-able",
that even a slightly lesser algorithm might be better
if it results "scale-able", especially if it breaks down
to a very, very minimal set of resources, in time,
and in various organizations of space, or distance,
which everybody knows as CPU, RAM, and DISK,
in terms of time, those of lookups per second,
and particularly where parallelizable as with
regards to both linear speed-up and also immutable
data structures, or, clustering.  ("Scale.")
Then it's probably so that the ranges are pretty small,
because they double, and whether it's best just to
have an overall single range, or, refinements of it,
according to a "factor", a "factor" that represents
how likely it is that hashes don't collide in the range,
or that they do.
This is a different way of looking at hash collisions,
besides that two objects have the same hash,
just that they're in the same partition of the range
their integer value, for fixed-length uniform hashes.
I.e., a hash collision proper would always be a
redundant or order-dependent dig-up, of a sort,
where the idea is that the lookup first results
searching the pyramid plan for possibles, then
digging up each of those and checking for match.
The idea that group x date sort of has that those
are about on the same order is a thing, then about
the idea that "category" and "year" are similarly
about so,
Big8 x year
group x date
it's very contrived to have those be on the same
order, in terms of otherwise partitioning, or about
what it results that "partitions are organized so that
their partitions are tuples and the tuples are about
on the same order, so it goes, thus that uniformity
of hashes, results being equi-distributed in those,
so that it results the factor is good and that arithmetic
hash ranges filter out most of the partitions, and,
especially that there aren't many false-positive dig-up
partitions.
It's sort of contrived, but then it does sort of make
it so that also other search concerns like "only these
groups or only these years anyways", naturally get
dropped out at the partition layer, and, right in the
front of the lookup algorithm.
It's pretty much expected though that there would
be non-zero false-positive dig-ups, where here a dig-up
is that the arithmetic hash range matched, but it's
actually a different Message ID's hash in the range,
and not the lookup value(s).
Right, so just re-capping here a bit, the idea is that
there are groups, and dates, and for each is a zip file,
which is a collection of files in a file-system entry file
with about random access on the zip file each entry,
and compressed, and the entries include Messages,
by their Message ID's, then that the entries are
maybe in sub-directories, that reflect components
of the Message ID's hash, where a hash, is a fixed-length
value, like 64 bytes or 128 bytes, or a power of two
and usually an even power of two thus a multiple of four,
thus that a 64 byte hash has 2^64 * 2^8 many possible
values, then that a range, of length R bits, has R many
partitions, in terms of the hash size and the range size,
whether the factor is low enough, that most partitions
will naturally be absent most ranges, because hashes
can only be computed from Message ID's, not by their
partitions or other information like the group or date.
So, if there are 2^30 or a billion messages, then a
32 bit hash, would have a fair expectation that
unused values would be not dense, then for
what gets into "birthday problem" or otherwise
how "Dirichlet principle" makes for how often
are hash collisions, for how often are range collisions,
either making redundant dig-ups, in the way this
sort of algorithm services look-ups.
The 32 bits is quite a bit less than 64 * 8, though,
about whether it would also result, that, splitting
that into subdirectories, results different organizations
here about "tuned to Usenet-scale and organization",
vis-a-vis, "everybody's email" or something like that.
That said, it shouldn't just fall apart if the size or
count blows up, though it might be expect then
a various sorts of partitioning, to keep the partition
tuple orders square, or on the same orders.
The md5 is widely available, "md5sum", it's 128 bits,
its output is hexadecimal characters, 32-many.
https://en.wikipedia.org/wiki/MD5
https://en.wikipedia.org/wiki/Partition_(database)
https://en.wikipedia.org/wiki/Hash_function#Uniformity
Otherwise the only goal of the hash is to be uniform,
and also to have "avalanche criterion", so that near Message-Id's
will still be expected to have different hashes, as it's not
necessarily expected that they're the same group and
date, though that would be a thing, yet Message ID's
should be considered opaque and not seated together.
Then MD5 is about the most usual hash utility laying
around, if not SHA-1, or SHA-256.  Hmm..., in the
interests of digital preservation is "the tools for
any algorithms should also be around forever",
one of those things.
So anyways, then each group x date has its Message ID's,
each of those has its hash, each of those fits in a range,
indicating one bit in the range where it is, then those are
OR'd together to result a bit-mask of the range, then
that a lookup can check its hash's bit against the range,
and dig-up the partition if it's in, or, skip the partition
if it's not, with the idea that the range is big enough
and the resulting group x date is small enough, that
the "pyramidal partition", is mostly sparse, at the lower
levels, that it's mostly "look-arounds" until finally the
"dig-ups", in the leaf nodes of the pyramidal partitions.
I.e., the dig-ups will eventually include spurious or
redundant false-positives, that the algorithm will
access the leaf partitions at uniform random.
The "pyramidal" then also get into both the empties,
like rec.calm with zero posts ten years running,
or alt.spew which any given day exceeds zip files
or results a lot of "zip format, but the variously
packaged, not-recompressed binaries", the various
other use cases than mostly at-rest and never-read
archival purposes.  The idea of the "arithmetic hash
range pyramidal partition" is that mostly the
leaf partitions are quite small and sparse, and
mostly the leveling of the pyramid into year/month/date
and big8/middle/group, as it were, winnows those
down in what's a constant-rate constant-space scan
on the immutable data structure of the partition pyramid.
Yeah, I know, "numbers", here though the idea is
that about 30K groups at around 18K days = 50 years
makes about 30 * 20 * million or less than a billion
files the zip files, which would all fit on a volume
that supports up to four billion-many files, or an
object-store, then with regards to that most of
those would be quite small or even empty,
then with regards to "building the pyramid",
the levels big8/middle/group X year/month/date,
the data structure of the hashes marking the ranges,
then those themselves resulting a file, which are
basically the entire contents of allocated RAM,
or for that matter a memory-mapped file, with
the idea that everything else is ephemeral RAM.

Date Sujet#  Auteur
3 Jan 25 o 

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal