Deutsch English Français Italiano |
<86a5jzmle1.fsf@linuxsc.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Tim Rentsch <tr.17687@z991.linuxsc.com> Newsgroups: comp.lang.c Subject: Re: Good hash for pointers Date: Wed, 05 Jun 2024 08:58:46 -0700 Organization: A noiseless patient Spider Lines: 117 Message-ID: <86a5jzmle1.fsf@linuxsc.com> References: <v2n88p$1nlcc$1@dont-email.me> <v2qm8m$2el55$1@raubtier-asyl.eternal-september.org> <v2qnue$2evlu$1@dont-email.me> <v2r9br$2hva2$1@dont-email.me> <86fru6gsqr.fsf@linuxsc.com> <v2sudq$2trh1$1@raubtier-asyl.eternal-september.org> <8634q5hjsp.fsf@linuxsc.com> <v2vmhr$3ffjk$1@raubtier-asyl.eternal-september.org> <86le3wfsmd.fsf@linuxsc.com> <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <20240605005916.00001b33@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Date: Wed, 05 Jun 2024 17:58:49 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f7ed60976cbd53203d898763c1c85511"; logging-data="1093949"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18wenoB+mh/evb+S6WeiZIi3r/Bd1g758Y=" User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux) Cancel-Lock: sha1:IAQi+utwRoKzeRRee692fcMYe9E= sha1:PWDdlk416HywO6BriCWSP7Jenlg= Bytes: 7311 Michael S <already5chosen@yahoo.com> writes: > On Sun, 26 May 2024 10:20:55 -0700 > Tim Rentsch <tr.17687@z991.linuxsc.com> wrote: [..comments on a hash function that simply multiplies the key by a large prime (18446744073709551557) to give the hash value, and noting that tests have shown poor performance..] > 18446744073709551557 is indeed very bad (too close to 2**64). > But I tried other prime and at least in my simple tests it performs > rather well. > May be my tests are to simplistic, I don't pretend to understand much > about it. > > [lots of C code] > > Compiled as: > gcc -s -Wall -O2 -march=native ptr_hash.c hash_bonita.c hash_aes4.c I couldn't get the AES-based hash function to compile. There was a complaint about not being able to inline an 'always-inline' function. I played around with it for a while but didn't get anywhere. I did get your own hash function out and put it into my little test rig. There may be summary comments below. > Run as: > ./a 100000 75000 80 1000 > > Result: > ref: 75000 keys, 100000 slots, 52799 occupied 17431 collisions. [...] > uut: 75000 keys, 100000 slots, 53201 occupied 17139 collisions. [...] Both of these occupancy rates are close to the expected theoretical average, which I calculated to be 52764. If I had been more ambitious I might have done some statistical sampling to get some idea of the standard deviation, but I wasn't that energetic today. :) If we're trying to write a general hash function, it should work well across a wide range of input key sets, and also for different kinds of hash table management. Let's look at the output side first. There are two independent axes for hash tables: table size, and direct mapping versus using rehashing. For table size, the two prominent choices are some prime in the appropriate range, or else a power of two. Clearly any size could be chosen, but primes have the nice property that they divide up the space of hash values very evenly, whereas powers of two allow a cheaper way of reducing a hash value to the range of table indices (anding versus a modulo operation). A good hash function should work well with both. The choice of direct mapping versus rehashing depends on expected occupancy rates (which I am used to calling "load factor"). For a load factor of 75 to 80 percent or less, generally rehashing is better. Direct mapping has the advantage that it can handle load factors above 100%, at a cost of dealing with whatever subsidiary data structure (usually linked lists) is used when two or more keys map to the same index in the hash table. (I expect you know all that, I wrote it mostly just to clarify my own thinking.) For the recent testing I did, I chose a power of two (65536) for the table size, and rehashing rather than direct mapping. Both of these choices make it easier to evaluate how well hash functions perform. Having a power-of-two table size reveals weaknesses in the hash function more readily than taking the hash value modulo a prime; using rehashing allows an easy computation of how many probes on average are needed to do an insertion, which is an easier metric to use for comparisons than occupancy rates (which typically need some statistical calculations to appreciate the meaning of different resulting occupancy rates). Let me emphasize that I don't mean to advocate any of the different hash table models over the others. The key point is that a good hash function should work with all of them. On the input side, it's important to have a wide range of input key sets, with different sorts of distributions. The input key sets might include: linear addresses out of a single array, so a+i*m with i in the range [0..k], for different values of m (to be thorough all m in the range [1..64], all multiples of 2 from 64 to 128, all multiples of 4 from 128 to 256, all multiples of 8 from 256 to 512, and so forth up to all multiples of 64 up from 2048 to 4096); results of malloc() using fixed sizes (all of the m values from the last example); results of malloc() using a range of sizes, for several kinds of random distributions); results of malloc() using linearly increasing sizes; and other kinds of variations. In my testing I used several fixed sizes to malloc() all less than 60; rand()%997 + 43 for malloc() sizes; malloc( i/7 + 1 ) for each key value sub i; malloc( i/17 + 1 ) for each key value sub i; linear addressing in an array, with various multipliers (ad hoc rather than systematic choices); quadratic addressing in an array (which means array + i*i*m, for index i and multiplier choice m) -- here again the choices for m were ad hoc rather than system. An interesting result under the "in array" choices is that they sometimes would change from run to run, presumably as the array was put at different starting positions. Sometimes the differences were surprisingly large. Like I said before, a good hash function should produce average or near-average values for all of these scenarios. It might be nice to see some above-average results under some workloads, but they don't make up for well below-average results under other workloads. The situation is pretty much the opposite of what happens with quicksort: the worst case behavior of quicksort is quadratic, but we know that in practice that almost never happens, so quicksort is used because its typical or average performance is better than that of other choices. With hashing, better-than-average performance is rare, and doesn't give much benefit, but worse-than-average performance is more likely, and can be disastrous. Obviously there can be other factors that might motivate other choices in some situations (eg, perfect hashing), but for a general-use hash function the first design criterion should be close-to-average behavior under all scenarios tested.