Deutsch English Français Italiano |
<86le3nne36.fsf@linuxsc.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch <tr.17687@z991.linuxsc.com> Newsgroups: comp.lang.c Subject: Re: Good hash for pointers Date: Sun, 02 Jun 2024 16:02:05 -0700 Organization: A noiseless patient Spider Lines: 157 Message-ID: <86le3nne36.fsf@linuxsc.com> References: <v2n88p$1nlcc$1@dont-email.me> <v2qm8m$2el55$1@raubtier-asyl.eternal-september.org> <v2qnue$2evlu$1@dont-email.me> <v2r9br$2hva2$1@dont-email.me> <86fru6gsqr.fsf@linuxsc.com> <v2sudq$2trh1$1@raubtier-asyl.eternal-september.org> <8634q5hjsp.fsf@linuxsc.com> <v2vmhr$3ffjk$1@raubtier-asyl.eternal-september.org> <86le3wfsmd.fsf@linuxsc.com> <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <v2vs40$3gflh$1@raubtier-asyl.eternal-september.org> <86sexypvff.fsf@linuxsc.com> <20240602104506.000072e4@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Date: Mon, 03 Jun 2024 01:02:07 +0200 (CEST) Injection-Info: dont-email.me; posting-host="df68cad4d9500866e5e56f3de5ab6749"; logging-data="3743225"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/PbW6tHAmy4DkLz2ZJR4Bt/HlaYLLdPJI=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:shyAvnUw7yPmRp5SsDU4r6Y1vaM= sha1:dr6mkuRPm95b+MArrajdp29ahbw= Bytes: 8062 Michael S <already5chosen@yahoo.com> writes: > On Thu, 30 May 2024 19:27:48 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: > >> Bonita Montero <Bonita.Montero@gmail.com> writes: >> >>> Am 26.05.2024 um 19:20 schrieb Tim Rentsch: >>> >>>> I say the output quality is poor because I have run tests that >>>> show the poor output quality. >>> >>> If you chose a prime whose double is beyond 64 bit there's an >>> equal distribution among the 64 bit modulos. >>> >>>> I've done that with a prime of my own choosing and also with >>>> 18446744073709551557, the value you suggested. >>> >>> Show me your code. >> >> Oh get real. It's not my job to make up for your >> laziness. > > So, what were your conclusions? > Ignoring the speed of computation, would something like > cryptographic hash scaled to bucket size be a best hash for > this type of application? Or some sort of less thorough > grinding of the bits is better? Factors that matter: will hashing be done using rehashing or aggregates (lists or trees, etc) grouped under the first probe location? (or both?) can keys be compared for ordering or just for equality? how expensive is it to test for key equality? (a) when the keys are equal (b) when the keys are not equal 32-bit hash values suffice for (a) re-hashing schemes with tables up to perhaps 2**30 entries, or (b) schemes that use direct hashing and bucket aggregation with tables up to 2**32 slots. Anything bigger should use 64-bit hash values. hash functions should always produce a full-size hash value (either 32 bits or 64 bits). Fitting the hash value to the size of the table is the responsibility of the hash table management code, not the hash function. The goal: An ideal hash function has the property that changing any single bit of the input key has a 50% chance of flipping any output bit, with no correlation between output bit m and output bit n (m != n) for changing a given input bit, and no correlation between changes to output bit n for changing input bit i and changing input bit j (i != j) Evaluating hash functions: One way to evaluate a hash function is to measure changes in output bits as a function of changing various inputs bits, either exhaustively or statistically, looking for an ideal distribution as described above. Usually doing this isn't very helpful, because most hash functions don't come close to the ideal, and it's hard to interpret results that are not close to the ideal, even though those hashes may be perfectly fine in practice So basically we are left with comparing a candidate hash function, performance-wise, to other hash functions that are going to have good performance results. Three obvious choices for those: a slow but high quality hash such as a cryptographic hash; using a good random number generator to produce keys where the hash value is equal to the key value; comparing against theoretical ideals. Looking at each in turn: High-quality-but-slow-hash: it isn't hard to make a hash function that approximates an ideal hash, as long as we are willing to have it run like a turtle. Or simply use one of the many known cryptographic hashes and fold the output down to a reasonably size hash values (most of these cryptographic hash functions produce large hash values, 128 bits and up, which isn't a great fit for ordinary hash table use). In my recent testing I simply rolled my own high quality hash, which ran about 10 times slower than the slowest of the other hashes I looked at. Using a good RNG to produce keys the same as the hash value: I expect this method to be self-explanatory. Comparing against theoretical ideals: it isn't hard to do a calculation that computes a statistical average for the two kinds of hashing methods. More specifically, (a) for re-hashing, compute the average number of probes taken for a hash table with k out of n slots already filled, and use that to compute the average number of probes need to fill the hash table up to, say, 85% (b) for direct hashing, compute the average number of slots occupied after inserting some number of values representing a fixed fraction of the table size. For example, if we have a table with 100,000 slots, if we insert 100,000 values then on the average about 63% of the slots should be occupied (very close to 1 - 1/e). Use one or more of the above comparison values (they should all be very close) to compare against the performance of the candidate hash function for each of the possible workloads. My workload set included: (a) results of malloc() with fixed sized blocks, with a size going from 1 to some modest value (50 or 100) (b) results of malloc() with randomly chosen sizes based on some distribution (c) sequential addresses in an array, varying the element size between 1 and some modest value (maybe 50) (d) results of malloc() with a deterministic set of sizes. I tried varying ones of these, such as size(i) == i/17 + 1 (filling a table of 65636 entries varying percentages full). What we would like to see: for every workload tested, results with 10 or 20% of the comparison metric used What we don't want to see (this is important): any result far away from the comparison metric on any workload. Importantly this includes results that are significantly _better_ than the comparison metric. The reason for this is that doing a lot better on one test inevitably means doing worse on a different test. The same is true of random number generators: the best ones look "averagely random" all the time, never either "sub random" or "super random". Of course we always want high quality, but the standards are higher for (a) direct hashing, and/or (b) cases where comparing keys is expensive. Note that if comparing keys is expensive we might choose to store the full hash value with each table entry, to make "trivial reject" on key comparison be faster. Probably much of the above was obvious. I make no apologies for that. Also it may seem disjointed or unorganized. If so then sorry about that chief, it's a big topic and there's a lot of ground to cover, and I tried to hit the most important highlights.