Sujet : Re: Good hash for pointers
De : already5chosen (at) *nospam* yahoo.com (Michael S)
Groupes : comp.lang.cDate : 06. Jun 2024, 11:35:34
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <20240606133534.00005c5d@yahoo.com>
References : 1 2 3 4 5 6 7 8 9 10 11 12 13
User-Agent : Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
On Wed, 05 Jun 2024 08:58:46 -0700
Tim Rentsch <
tr.17687@z991.linuxsc.com> wrote:
If we're trying to write a general hash function, it should work
well across a wide range of input key sets, and also for different
kinds of hash table management. Let's look at the output side
first. There are two independent axes for hash tables: table
size, and direct mapping versus using rehashing. For table size,
the two prominent choices are some prime in the appropriate range,
or else a power of two. Clearly any size could be chosen, but
primes have the nice property that they divide up the space of hash
values very evenly, whereas powers of two allow a cheaper way of
reducing a hash value to the range of table indices (anding versus a
modulo operation). A good hash function should work well with both.
The choice of direct mapping versus rehashing depends on expected
occupancy rates (which I am used to calling "load factor"). For a
load factor of 75 to 80 percent or less, generally rehashing is
better. Direct mapping has the advantage that it can handle load
factors above 100%, at a cost of dealing with whatever subsidiary
data structure (usually linked lists) is used when two or more keys
map to the same index in the hash table. (I expect you know all
that, I wrote it mostly just to clarify my own thinking.)
>
I never had interest in implementation of hash methods, typically just
took what language or library provides and used it.
All my knowledge of internals are 50 y.o. news (TAoCP, and no I did not
read it 50 years ago, but I did read it first time slightly less than
40 years ago and not sure if I re-read this particular topic since
then).
So, I am not sure that I understood everything above.
In particular, w.r.t. rehashing, IIRC, Knuth came to conclusion that
simple circular increment of the index is superior to more elaborate
methods. I don't know if conclusion was changed in the following years,
not even sure that I recollect it correctly.
Despite (or due to?) my relative ignorance of the topic, I'd dare to
disagree with couple of your points:
1. I don't think that hash function should be evaluated in isolation
from the rest of algorithm. IMHO, they should be co-developed. There
is nothing wrong with hash function that works well with one particular
"back end" and poorly with all others as long as limitations are
clearly stated.
2. I don't like "modulo" method of reduction of hash value to range. I
don't like it for any chosen size of the bucket, not just for power of
two, but even for prime. Of course, for power of two size my dislike to
it is deeper.