Michael S <
already5chosen@yahoo.com> writes:
On Sun, 02 Jun 2024 16:02:05 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
Michael S <already5chosen@yahoo.com> writes:
>
On Thu, 30 May 2024 19:27:48 -0700
Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
[... concerning quality of hash functions ...]
So, what were your conclusions?
Ignoring the speed of computation, would something like
cryptographic hash scaled to bucket size be a best hash for
this type of application? Or some sort of less thorough
grinding of the bits is better?
>
Factors that matter:
>
<snip>
>
Probably much of the above was obvious. I make no apologies for
that. Also it may seem disjointed or unorganized. If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.
>
It sounds like you *postulate* that crypto is an ideal.
I think it's more accurate to say that I _expect_ an established
crypto-quality hash function such as MD5 to be good approximation
to an ideal hash function. I expect that because (a) the design
criteria for such hash functions resembles or includes the design
criteria for a near-ideal hash function, (b) the people who have
done these are smart people with (usually) a fair amount of
experience doing such things, and (c) the established ones have
gone through a lot of use and testing by the user community, who
likely would have found any serious weaknesses.
I am less in axioms and more interested in your experimental
findings.
I'm not sure what you're looking for here. The testing I did for
the various recently posted hash functions was simply running a
few ad hoc test scenarios and comparing the results against each
other and also against the results of hash function of my own
devising likely to produce high quality results. One piece of
empirical evidence: in all of the tests I ran, that hash function
never gave results that varied by more than 10% or so, not matter
what test was done. By contrast, most or all of the other hash
functions sometimes gave results that varied by a factor of 4 or
more, depending on what scenario was being tested. Let me say
again that the test scenarios were ad hoc, and I was not being
systematic but just eyeballing the numbers.
I have a fair amount of experience (at least hundreds of hours)
experimenting with various random number generators, including
several of my own devising. The experience includes systematic
testing using extensive test suites (done by people who really
know what they are doing, which in this arena does not include
myself). Random number generators and hash functions have a lot
of principles in common. Suppose we have a hash function that
takes a 64-bit key and produces a 32-bit hash value. If we use
that hash function with simply successive 64-bit integer values,
and the outputs pass the statistical tests of a good random
number generator test suite, you can be sure that hash function
has very high quality. A key observation about random number
generators is this: if the output values are N bits, generally
the amount of internal state should be at least 2N bits. The
same thing is true of hash functions. If a hash function
produces, say, 32-bit values, internally there should be at least
64 bits of state that is mushed together in various ways, finally
being combined in the final step to produce the 32-bit result.
Another design principle is to replicate information. For
example, if we are hashing a character string to produce a 32-bit
hash value, we might combine each character in the string with
two different 32-bit "accumulators", perhaps adding into one and
xoring into the other, and rotating them by different amounts
after each character scanned. Composing code on the fly (not
compiled, and certainly not tested):
unsigned
string_hash( const char *s ){
unsigned a = 12345, b = 0xabcdef;
char c;
while( c = *s++ ){
a += c, b ^= c;
a = a << 23 | a >> 32-23;
b = b << 10 | b >> 32-10;
a -= b; // why not? ;)
}
return a ^ b;
}
This code has a fair chance of producing acceptable hash values
for strings that have at least three characters. (Let me say
again that I have done no testing at all.)
So I hope this posting has some of what you're looking for. And
if it doesn't, well, then I am as much in the dark as ever. :)