Deutsch English Français Italiano |
<86r0ddmsf6.fsf@linuxsc.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch <tr.17687@z991.linuxsc.com> Newsgroups: comp.lang.c Subject: Re: Good hash for pointers Date: Mon, 03 Jun 2024 18:02:21 -0700 Organization: A noiseless patient Spider Lines: 101 Message-ID: <86r0ddmsf6.fsf@linuxsc.com> References: <v2n88p$1nlcc$1@dont-email.me> <v2qm8m$2el55$1@raubtier-asyl.eternal-september.org> <v2qnue$2evlu$1@dont-email.me> <v2r9br$2hva2$1@dont-email.me> <86fru6gsqr.fsf@linuxsc.com> <v2sudq$2trh1$1@raubtier-asyl.eternal-september.org> <8634q5hjsp.fsf@linuxsc.com> <v2vmhr$3ffjk$1@raubtier-asyl.eternal-september.org> <86le3wfsmd.fsf@linuxsc.com> <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <v2vs40$3gflh$1@raubtier-asyl.eternal-september.org> <86sexypvff.fsf@linuxsc.com> <20240602104506.000072e4@yahoo.com> <86le3nne36.fsf@linuxsc.com> <20240603105005.0000091f@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Date: Tue, 04 Jun 2024 03:02:23 +0200 (CEST) Injection-Info: dont-email.me; posting-host="2e3473276f70e973f0e20911b7f6b99d"; logging-data="136753"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18K3YfQ994DrKXk9nyjq/yrR7V86UI4IeE=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:T0NMVEU/0WkrtKdbN657sxGlb7k= sha1:ITmaDoXZw2c5jOU8hMrwVVS3vpY= Bytes: 6116 Michael S <already5chosen@yahoo.com> writes: > On Sun, 02 Jun 2024 16:02:05 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> Michael S <already5chosen@yahoo.com> writes: >> >>> On Thu, 30 May 2024 19:27:48 -0700 >>> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: [... concerning quality of hash functions ...] >>> So, what were your conclusions? >>> Ignoring the speed of computation, would something like >>> cryptographic hash scaled to bucket size be a best hash for >>> this type of application? Or some sort of less thorough >>> grinding of the bits is better? >> >> Factors that matter: > > <snip> > >> Probably much of the above was obvious. I make no apologies for >> that. Also it may seem disjointed or unorganized. If so then >> sorry about that chief, it's a big topic and there's a lot of >> ground to cover, and I tried to hit the most important >> highlights. > > It sounds like you *postulate* that crypto is an ideal. I think it's more accurate to say that I _expect_ an established crypto-quality hash function such as MD5 to be good approximation to an ideal hash function. I expect that because (a) the design criteria for such hash functions resembles or includes the design criteria for a near-ideal hash function, (b) the people who have done these are smart people with (usually) a fair amount of experience doing such things, and (c) the established ones have gone through a lot of use and testing by the user community, who likely would have found any serious weaknesses. > I am less in axioms and more interested in your experimental > findings. I'm not sure what you're looking for here. The testing I did for the various recently posted hash functions was simply running a few ad hoc test scenarios and comparing the results against each other and also against the results of hash function of my own devising likely to produce high quality results. One piece of empirical evidence: in all of the tests I ran, that hash function never gave results that varied by more than 10% or so, not matter what test was done. By contrast, most or all of the other hash functions sometimes gave results that varied by a factor of 4 or more, depending on what scenario was being tested. Let me say again that the test scenarios were ad hoc, and I was not being systematic but just eyeballing the numbers. I have a fair amount of experience (at least hundreds of hours) experimenting with various random number generators, including several of my own devising. The experience includes systematic testing using extensive test suites (done by people who really know what they are doing, which in this arena does not include myself). Random number generators and hash functions have a lot of principles in common. Suppose we have a hash function that takes a 64-bit key and produces a 32-bit hash value. If we use that hash function with simply successive 64-bit integer values, and the outputs pass the statistical tests of a good random number generator test suite, you can be sure that hash function has very high quality. A key observation about random number generators is this: if the output values are N bits, generally the amount of internal state should be at least 2N bits. The same thing is true of hash functions. If a hash function produces, say, 32-bit values, internally there should be at least 64 bits of state that is mushed together in various ways, finally being combined in the final step to produce the 32-bit result. Another design principle is to replicate information. For example, if we are hashing a character string to produce a 32-bit hash value, we might combine each character in the string with two different 32-bit "accumulators", perhaps adding into one and xoring into the other, and rotating them by different amounts after each character scanned. Composing code on the fly (not compiled, and certainly not tested): unsigned string_hash( const char *s ){ unsigned a = 12345, b = 0xabcdef; char c; while( c = *s++ ){ a += c, b ^= c; a = a << 23 | a >> 32-23; b = b << 10 | b >> 32-10; a -= b; // why not? ;) } return a ^ b; } This code has a fair chance of producing acceptable hash values for strings that have at least three characters. (Let me say again that I have done no testing at all.) So I hope this posting has some of what you're looking for. And if it doesn't, well, then I am as much in the dark as ever. :)