Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: David Brown Newsgroups: comp.lang.c Subject: Re: realloc() - frequency, conditions, or experiences about relocation? Date: Mon, 17 Jun 2024 20:11:48 +0200 Organization: A noiseless patient Spider Lines: 52 Message-ID: References: <875xu8vsen.fsf@bsb.me.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Mon, 17 Jun 2024 20:11:48 +0200 (CEST) Injection-Info: dont-email.me; posting-host="c4e4b5756e2db6b50d94114c01c52c76"; logging-data="907240"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+vZHzQ924OK8Uhg3e+SZwoh06+tm9mgHw=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:BJ2wRFizYpsBjMf9s254l3077tk= In-Reply-To: Content-Language: en-GB Bytes: 3680 On 17/06/2024 11:31, Malcolm McLean wrote: > On 17/06/2024 10:18, Ben Bacarisse wrote: >> Janis Papanagnou writes: >> >>> In a recent thread realloc() was a substantial part of the discussion. >>> "Occasionally" the increased data storage will be relocated along >>> with the previously stored data. On huge data sets that might be a >>> performance factor. Is there any experience or are there any concrete >>> factors about the conditions when this relocation happens? - I could >>> imagine that it's no issue as long as you're in some kB buffer range, >>> but if, say, we're using realloc() to substantially increase buffers >>> often it might be an issue to consider. It would be good to get some >>> feeling about that internal. >> >> There is obviously a cost, but there is (usually) no alternative if >> contiguous storage is required.  In practice, the cost is usually >> moderate and can be very effectively managed by using an exponential >> allocation scheme: at every reallocation multiply the storage space by >> some factor greater than 1 (I often use 3/2, but doubling is often used >> as well).  This results in O(log(N)) rather than O(N) allocations as in >> your code that added a constant to the size.  Of course, some storage is >> wasted (that /might/ be retrieved by a final realloc down to the final >> size) but that's rarely significant. >> > So can we work it out? > > Let's assume for the moment that the allocations have a semi-normal > distribution, with negative values disallowed. Now ignoring the first > few values, if we have allocated, say, 1K, we ought to be able to > predict the value by integrating the distribution from 1k to infinity > and taking the mean. > First, there is no reason for assuming such a distribution, other than saying "lots of things are roughly normal". Secondly, knowing the distribution gives you /no/ information about any given particular case. You know the distribution for the results of rolling two die - does that mean you can predict the next roll? Thirdly, not all distributions have a mean (look up the Cauchy distribution if you like). Fourthly, even if you know the mean, it tells you nothing of use. Knowing a bit about the distribution of file sizes can be useful, but not nearly in the way you describe here. If you know that the files are rarely or never bigger than 10 MB, malloc 10 MB and forget the realloc. If you know they are often bigger than that, mmap the file and forget the realloc.