Michael S <
already5chosen@yahoo.com> writes:
On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
Bonita Montero <Bonita.Montero@gmail.com> writes:
>
Am 26.05.2024 um 19:20 schrieb Tim Rentsch:
>
I say the output quality is poor because I have run tests that
show the poor output quality.
>
If you chose a prime whose double is beyond 64 bit there's an
equal distribution among the 64 bit modulos.
>
I've done that with a prime of my own choosing and also with
18446744073709551557, the value you suggested.
>
Show me your code.
>
Oh get real. It's not my job to make up for your
laziness.
>
So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for
this type of application? Or some sort of less thorough
grinding of the bits is better?
Factors that matter:
will hashing be done using rehashing or aggregates (lists
or trees, etc) grouped under the first probe location?
(or both?)
can keys be compared for ordering or just for equality?
how expensive is it to test for key equality?
(a) when the keys are equal
(b) when the keys are not equal
32-bit hash values suffice for (a) re-hashing schemes with
tables up to perhaps 2**30 entries, or (b) schemes that
use direct hashing and bucket aggregation with tables up
to 2**32 slots. Anything bigger should use 64-bit hash
values.
hash functions should always produce a full-size hash value
(either 32 bits or 64 bits). Fitting the hash value to
the size of the table is the responsibility of the hash
table management code, not the hash function.
The goal:
An ideal hash function has the property that changing any
single bit of the input key has a 50% chance of flipping
any output bit, with no correlation between output bit
m and output bit n (m != n) for changing a given input
bit, and no correlation between changes to output bit n
for changing input bit i and changing input bit j (i != j)
Evaluating hash functions:
One way to evaluate a hash function is to measure changes
in output bits as a function of changing various inputs
bits, either exhaustively or statistically, looking for an
ideal distribution as described above. Usually doing this
isn't very helpful, because most hash functions don't come
close to the ideal, and it's hard to interpret results that
are not close to the ideal, even though those hashes may be
perfectly fine in practice
So basically we are left with comparing a candidate hash
function, performance-wise, to other hash functions that
are going to have good performance results. Three obvious
choices for those: a slow but high quality hash such as a
cryptographic hash; using a good random number generator
to produce keys where the hash value is equal to the key
value; comparing against theoretical ideals. Looking at
each in turn:
High-quality-but-slow-hash: it isn't hard to make a hash
function that approximates an ideal hash, as long as we
are willing to have it run like a turtle. Or simply use
one of the many known cryptographic hashes and fold the
output down to a reasonably size hash values (most of
these cryptographic hash functions produce large hash
values, 128 bits and up, which isn't a great fit for
ordinary hash table use). In my recent testing I simply
rolled my own high quality hash, which ran about 10 times
slower than the slowest of the other hashes I looked at.
Using a good RNG to produce keys the same as the hash
value: I expect this method to be self-explanatory.
Comparing against theoretical ideals: it isn't hard to
do a calculation that computes a statistical average for
the two kinds of hashing methods. More specifically,
(a) for re-hashing, compute the average number of
probes taken for a hash table with k out of n
slots already filled, and use that to compute
the average number of probes need to fill the
hash table up to, say, 85%
(b) for direct hashing, compute the average number
of slots occupied after inserting some number
of values representing a fixed fraction of the
table size. For example, if we have a table
with 100,000 slots, if we insert 100,000 values
then on the average about 63% of the slots
should be occupied (very close to 1 - 1/e).
Use one or more of the above comparison values (they
should all be very close) to compare against the
performance of the candidate hash function for each
of the possible workloads. My workload set included:
(a) results of malloc() with fixed sized blocks,
with a size going from 1 to some modest value
(50 or 100)
(b) results of malloc() with randomly chosen sizes
based on some distribution
(c) sequential addresses in an array, varying the
element size between 1 and some modest value
(maybe 50)
(d) results of malloc() with a deterministic set
of sizes. I tried varying ones of these, such
as size(i) == i/17 + 1 (filling a table of
65636 entries varying percentages full).
What we would like to see:
for every workload tested, results with 10 or 20%
of the comparison metric used
What we don't want to see (this is important):
any result far away from the comparison metric on any
workload. Importantly this includes results that are
significantly _better_ than the comparison metric.
The reason for this is that doing a lot better on one
test inevitably means doing worse on a different test.
The same is true of random number generators: the
best ones look "averagely random" all the time, never
either "sub random" or "super random".
Of course we always want high quality, but the standards
are higher for (a) direct hashing, and/or (b) cases where
comparing keys is expensive. Note that if comparing keys
is expensive we might choose to store the full hash value
with each table entry, to make "trivial reject" on key
comparison be faster.
Probably much of the above was obvious. I make no apologies for
that. Also it may seem disjointed or unorganized. If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.