Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Richard Harnden <richard.nospam@gmail.invalid>
Newsgroups: comp.lang.c
Subject: Re: realloc() - frequency, conditions, or experiences about
 relocation?
Date: Mon, 17 Jun 2024 16:15:27 +0100
Organization: A noiseless patient Spider
Lines: 100
Message-ID: <v4pjuf$n98h$1@dont-email.me>
References: <v4ojs8$gvji$1@dont-email.me> <875xu8vsen.fsf@bsb.me.uk>
 <v4ovqf$j5hq$1@dont-email.me> <87zfrjvqp6.fsf@bsb.me.uk>
 <v4pb4v$lhgk$1@dont-email.me> <87tthrvdtg.fsf@bsb.me.uk>
Reply-To: richard.harnden@invalid.com
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jun 2024 17:15:27 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="adebfc758c55f317ee4a5e3ad2ba1ae4";
	logging-data="763153"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18DEZbDBn2/EW7dF+Uy3L5HP8AllpeU+Eo="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:gaB1NPToFxo/ulraijCfCFyn8hU=
Content-Language: en-GB
In-Reply-To: <87tthrvdtg.fsf@bsb.me.uk>
Bytes: 5850

On 17/06/2024 15:33, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> 
>> On 17/06/2024 10:55, Ben Bacarisse wrote:
>>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>
>>>> On 17/06/2024 10:18, Ben Bacarisse wrote:
>>>>> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
>>>>>
>>>>>> In a recent thread realloc() was a substantial part of the discussion.
>>>>>> "Occasionally" the increased data storage will be relocated along
>>>>>> with the previously stored data. On huge data sets that might be a
>>>>>> performance factor. Is there any experience or are there any concrete
>>>>>> factors about the conditions when this relocation happens? - I could
>>>>>> imagine that it's no issue as long as you're in some kB buffer range,
>>>>>> but if, say, we're using realloc() to substantially increase buffers
>>>>>> often it might be an issue to consider. It would be good to get some
>>>>>> feeling about that internal.
>>>>> There is obviously a cost, but there is (usually) no alternative if
>>>>> contiguous storage is required.  In practice, the cost is usually
>>>>> moderate and can be very effectively managed by using an exponential
>>>>> allocation scheme: at every reallocation multiply the storage space by
>>>>> some factor greater than 1 (I often use 3/2, but doubling is often used
>>>>> as well).  This results in O(log(N)) rather than O(N) allocations as in
>>>>> your code that added a constant to the size.  Of course, some storage is
>>>>> wasted (that /might/ be retrieved by a final realloc down to the final
>>>>> size) but that's rarely significant.
>>>>>
>>>> So can we work it out?
>>> What is "it"?
>>>
>>>> Let's assume for the moment that the allocations have a semi-normal
>>>> distribution,
>>> What allocations?  The allocations I talked about don't have that
>>> distribution.
>>>
>>>> with negative values disallowed. Now ignoring the first few
>>>> values, if we have allocated, say, 1K, we ought to be able to predict the
>>>> value by integrating the distribution from 1k to infinity and taking the
>>>> mean.
>>> I have no idea what you are talking about.  What "value" are you looking
>>> to calculate?
>>>
>> We have a continuously growing buffer, and we want the best strategy for
>> reallocations as the stream of characters comes at us. So, given we now how
>> many characters have arrived, can we predict how many will arrive, and
>> therefore ask for the best amount when we reallocate, so that we neither
>> make too many reallocation (reallocate on every byte received) or ask for
>> too much (demand SIZE_MAX memory when the first byte is received).?
> 
> Obviously not, or we'd use the prediction.  You question was probably
> rhetorical, but it didn't read that way.
> 
>> Your strategy for avoiding these extremes is exponential growth.
> 
> It's odd to call it mine.  It's very widely know and used.  "The one I
> mentioned" might be less confusing description.
> 
>> You
>> allocate a small amount for the first few bytes. Then you use exponential
>> growth, with a factor of ether 2 or 1.5. My question is whether or not we
>> can be cuter. And of course we need to know the statistical distribution of
>> the input files. And I'm assuming a semi-normal distribution, ignoring the
>> files with small values, which we will allocate enough for anyway.
>>
>> And so we integrate the distribution between the point we are at and
>> infinity. Then we tkae the mean. And that gives us a best estimate of how
>> many bytes are to come, and therefore how much to grow the buffer by.
> 
> I would be surprised if that were worth the effort at run time.  A
> static analysis of "typical" input sizes might be interesting as that
> could be used to get an estimate of good factors to use, but anything
> more complicated than maybe a few factors (e.g. doubling up to 1MB then
> 3/2 thereafter) is likely to be too messy to useful.
> 
> Also, the cost of reallocations is not constant.  Larger ones are
> usually more costly than small ones, so if one were going to a lot of
> effort to make run-time guesses, that cost should be factored in as
> well.
> 

I usually keep track:

struct
{
     size_t used;
     size_t allocated;
     void *data;
};

Then, if used + new_size is more than what's already been allocated then 
a realloc will be required.

Start with an initial allocated size that's 'resonable' - the happy path 
will never need any reallocs.

Otherwise multiply by some factor.  Typicall I just double it.