Article <86le3nne36.fsf@linuxsc.com>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <86le3nne36.fsf@linuxsc.com>
Deutsch English Français Italiano
<86le3nne36.fsf@linuxsc.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.lang.c
Subject: Re: Good hash for pointers
Date: Sun, 02 Jun 2024 16:02:05 -0700
Organization: A noiseless patient Spider
Lines: 157
Message-ID: <86le3nne36.fsf@linuxsc.com>
References: <v2n88p$1nlcc$1@dont-email.me> <v2qm8m$2el55$1@raubtier-asyl.eternal-september.org> <v2qnue$2evlu$1@dont-email.me> <v2r9br$2hva2$1@dont-email.me> <86fru6gsqr.fsf@linuxsc.com> <v2sudq$2trh1$1@raubtier-asyl.eternal-september.org> <8634q5hjsp.fsf@linuxsc.com> <v2vmhr$3ffjk$1@raubtier-asyl.eternal-september.org> <86le3wfsmd.fsf@linuxsc.com> <v2voe7$3fr50$1@raubtier-asyl.eternal-september.org> <86ed9ofq14.fsf@linuxsc.com> <v2vs40$3gflh$1@raubtier-asyl.eternal-september.org> <86sexypvff.fsf@linuxsc.com> <20240602104506.000072e4@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Date: Mon, 03 Jun 2024 01:02:07 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="df68cad4d9500866e5e56f3de5ab6749";
	logging-data="3743225"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/PbW6tHAmy4DkLz2ZJR4Bt/HlaYLLdPJI="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:shyAvnUw7yPmRp5SsDU4r6Y1vaM=
	sha1:dr6mkuRPm95b+MArrajdp29ahbw=
Bytes: 8062

Michael S <already5chosen@yahoo.com> writes:

> On Thu, 30 May 2024 19:27:48 -0700
> Tim Rentsch <tr.17687@z991.linuxsc.com> wrote:
>
>> Bonita Montero <Bonita.Montero@gmail.com> writes:
>>
>>> Am 26.05.2024 um 19:20 schrieb Tim Rentsch:
>>>
>>>> I say the output quality is poor because I have run tests that
>>>> show the poor output quality.
>>>
>>> If you chose a prime whose double is beyond 64 bit there's an
>>> equal distribution among the 64 bit modulos.
>>>
>>>> I've done that with a prime of my own choosing and also with
>>>> 18446744073709551557, the value you suggested.
>>>
>>> Show me your code.
>>
>> Oh get real.  It's not my job to make up for your
>> laziness.
>
> So, what were your conclusions?
> Ignoring the speed of computation, would something like
> cryptographic hash scaled to bucket size be a best hash for
> this type of application?  Or some sort of less thorough
> grinding of the bits is better?

Factors that matter:
  will hashing be done using rehashing or aggregates (lists
    or trees, etc) grouped under the first probe location?
    (or both?)

  can keys be compared for ordering or just for equality?

  how expensive is it to test for key equality?
    (a) when the keys are equal
    (b) when the keys are not equal

  32-bit hash values suffice for (a) re-hashing schemes with
    tables up to perhaps 2**30 entries, or (b) schemes that
    use direct hashing and bucket aggregation with tables up
    to 2**32 slots.  Anything bigger should use 64-bit hash
    values.

  hash functions should always produce a full-size hash value
    (either 32 bits or 64 bits).  Fitting the hash value to
    the size of the table is the responsibility of the hash
    table management code, not the hash function.

The goal:
  An ideal hash function has the property that changing any
  single bit of the input key has a 50% chance of flipping
  any output bit, with no correlation between output bit
  m and output bit n (m != n) for changing a given input
  bit, and no correlation between changes to output bit n
  for changing input bit i and changing input bit j (i != j)

Evaluating hash functions:
  One way to evaluate a hash function is to measure changes
  in output bits as a function of changing various inputs
  bits, either exhaustively or statistically, looking for an
  ideal distribution as described above.  Usually doing this
  isn't very helpful, because most hash functions don't come
  close to the ideal, and it's hard to interpret results that
  are not close to the ideal, even though those hashes may be
  perfectly fine in practice

  So basically we are left with comparing a candidate hash
  function, performance-wise, to other hash functions that
  are going to have good performance results.  Three obvious
  choices for those:  a slow but high quality hash such as a
  cryptographic hash;  using a good random number generator
  to produce keys where the hash value is equal to the key
  value;  comparing against theoretical ideals.  Looking at
  each in turn:

  High-quality-but-slow-hash:  it isn't hard to make a hash
  function that approximates an ideal hash, as long as we
  are willing to have it run like a turtle.  Or simply use
  one of the many known cryptographic hashes and fold the
  output down to a reasonably size hash values (most of
  these cryptographic hash functions produce large hash
  values, 128 bits and up, which isn't a great fit for
  ordinary hash table use).  In my recent testing I simply
  rolled my own high quality hash, which ran about 10 times
  slower than the slowest of the other hashes I looked at.

  Using a good RNG to produce keys the same as the hash
  value:  I expect this method to be self-explanatory.

  Comparing against theoretical ideals:  it isn't hard to
  do a calculation that computes a statistical average for
  the two kinds of hashing methods.  More specifically,

    (a) for re-hashing, compute the average number of
        probes taken for a hash table with k out of n
        slots already filled, and use that to compute
        the average number of probes need to fill the
        hash table up to, say, 85%

    (b) for direct hashing, compute the average number
        of slots occupied after inserting some number
        of values representing a fixed fraction of the
        table size.  For example, if we have a table
        with 100,000 slots, if we insert 100,000 values
        then on the average about 63% of the slots
        should be occupied (very close to 1 - 1/e).

  Use one or more of the above comparison values (they
  should all be very close) to compare against the
  performance of the candidate hash function for each
  of the possible workloads.  My workload set included:

    (a) results of malloc() with fixed sized blocks,
        with a size going from 1 to some modest value
        (50 or 100)

    (b) results of malloc() with randomly chosen sizes
        based on some distribution

    (c) sequential addresses in an array, varying the
        element size between 1 and some modest value
        (maybe 50)

    (d) results of malloc() with a deterministic set
        of sizes.  I tried varying ones of these, such
        as size(i) == i/17 + 1 (filling a table of
        65636 entries varying percentages full).

What we would like to see:
  for every workload tested, results with 10 or 20%
  of the comparison metric used

What we don't want to see (this is important):
  any result far away from the comparison metric on any
  workload.  Importantly this includes results that are
  significantly _better_ than the comparison metric.
  The reason for this is that doing a lot better on one
  test inevitably means doing worse on a different test.
  The same is true of random number generators:  the
  best ones look "averagely random" all the time, never
  either "sub random" or "super random".

Of course we always want high quality, but the standards
are higher for (a) direct hashing, and/or (b) cases where
comparing keys is expensive.  Note that if comparing keys
is expensive we might choose to store the full hash value
with each table entry, to make "trivial reject" on key
comparison be faster.

Probably much of the above was obvious.  I make no apologies for
that.  Also it may seem disjointed or unorganized.  If so then
sorry about that chief, it's a big topic and there's a lot of
ground to cover, and I tried to hit the most important
highlights.