Michael S <
already5chosen@yahoo.com> writes:
On Sun, 26 May 2024 10:20:55 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
[..comments on a hash function that simply multiplies the key
by a large prime (18446744073709551557) to give the hash
value, and noting that tests have shown poor performance..]
18446744073709551557 is indeed very bad (too close to 2**64).
But I tried other prime and at least in my simple tests it performs
rather well.
May be my tests are to simplistic, I don't pretend to understand much
about it.
>
[lots of C code]
>
Compiled as:
gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c
I couldn't get the AES-based hash function to compile. There
was a complaint about not being able to inline an 'always-inline'
function. I played around with it for a while but didn't get
anywhere.
I did get your own hash function out and put it into my little
test rig. There may be summary comments below.
Run as:
./a 100000 75000 80 1000
>
Result:
ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. [...]
uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. [...]
Both of these occupancy rates are close to the expected theoretical
average, which I calculated to be 52764. If I had been more
ambitious I might have done some statistical sampling to get some
idea of the standard deviation, but I wasn't that energetic
today. :)
If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management. Let's look at the output side
first. There are two independent axes for hash tables: table
size, and direct mapping versus using rehashing. For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two. Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation). A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor"). For a
load factor of 75 to 80 percent or less, generally rehashing is
better. Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table. (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)
For the recent testing I did, I chose a power of two (65536) for the
table size, and rehashing rather than direct mapping. Both of these
choices make it easier to evaluate how well hash functions perform.
Having a power-of-two table size reveals weaknesses in the hash
function more readily than taking the hash value modulo a prime;
using rehashing allows an easy computation of how many probes on
average are needed to do an insertion, which is an easier metric
to use for comparisons than occupancy rates (which typically need
some statistical calculations to appreciate the meaning of different
resulting occupancy rates).
Let me emphasize that I don't mean to advocate any of the different
hash table models over the others. The key point is that a good
hash function should work with all of them.
On the input side, it's important to have a wide range of input
key sets, with different sorts of distributions. The input key
sets might include: linear addresses out of a single array, so
a+i*m with i in the range [0..k], for different values of m (to
be thorough all m in the range [1..64], all multiples of 2 from
64 to 128, all multiples of 4 from 128 to 256, all multiples of 8
from 256 to 512, and so forth up to all multiples of 64 up from
2048 to 4096); results of malloc() using fixed sizes (all of the
m values from the last example); results of malloc() using a
range of sizes, for several kinds of random distributions);
results of malloc() using linearly increasing sizes; and other
kinds of variations. In my testing I used several fixed sizes to
malloc() all less than 60; rand()%997 + 43 for malloc() sizes;
malloc( i/7 + 1 ) for each key value sub i; malloc( i/17 + 1 )
for each key value sub i; linear addressing in an array, with
various multipliers (ad hoc rather than systematic choices);
quadratic addressing in an array (which means array + i*i*m, for
index i and multiplier choice m) -- here again the choices for
m were ad hoc rather than system.
An interesting result under the "in array" choices is that they
sometimes would change from run to run, presumably as the array was
put at different starting positions. Sometimes the differences were
surprisingly large.
Like I said before, a good hash function should produce average
or near-average values for all of these scenarios. It might be
nice to see some above-average results under some workloads, but
they don't make up for well below-average results under other
workloads. The situation is pretty much the opposite of what
happens with quicksort: the worst case behavior of quicksort is
quadratic, but we know that in practice that almost never
happens, so quicksort is used because its typical or average
performance is better than that of other choices. With hashing,
better-than-average performance is rare, and doesn't give much
benefit, but worse-than-average performance is more likely, and
can be disastrous. Obviously there can be other factors that
might motivate other choices in some situations (eg, perfect
hashing), but for a general-use hash function the first design
criterion should be close-to-average behavior under all scenarios
tested.