Article <v4pb4v$lhgk$1@dont-email.me>

Deutsch English Français Italiano
<v4pb4v$lhgk$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Malcolm McLean <malcolm.arthur.mclean@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: realloc() - frequency, conditions, or experiences about
 relocation?
Date: Mon, 17 Jun 2024 13:45:19 +0100
Organization: A noiseless patient Spider
Lines: 65
Message-ID: <v4pb4v$lhgk$1@dont-email.me>
References: <v4ojs8$gvji$1@dont-email.me> <875xu8vsen.fsf@bsb.me.uk>
 <v4ovqf$j5hq$1@dont-email.me> <87zfrjvqp6.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jun 2024 14:45:19 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f3df68eb903434ea9efb6aee921a4925";
	logging-data="706068"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19sTFwr4AmENuJZLmuA9zuFimiGNcxcCwY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:SEhC4UDf/MewTmUE3uaxLRCL4nE=
Content-Language: en-GB
In-Reply-To: <87zfrjvqp6.fsf@bsb.me.uk>
Bytes: 4440

On 17/06/2024 10:55, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> 
>> On 17/06/2024 10:18, Ben Bacarisse wrote:
>>> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
>>>
>>>> In a recent thread realloc() was a substantial part of the discussion.
>>>> "Occasionally" the increased data storage will be relocated along
>>>> with the previously stored data. On huge data sets that might be a
>>>> performance factor. Is there any experience or are there any concrete
>>>> factors about the conditions when this relocation happens? - I could
>>>> imagine that it's no issue as long as you're in some kB buffer range,
>>>> but if, say, we're using realloc() to substantially increase buffers
>>>> often it might be an issue to consider. It would be good to get some
>>>> feeling about that internal.
>>> There is obviously a cost, but there is (usually) no alternative if
>>> contiguous storage is required.  In practice, the cost is usually
>>> moderate and can be very effectively managed by using an exponential
>>> allocation scheme: at every reallocation multiply the storage space by
>>> some factor greater than 1 (I often use 3/2, but doubling is often used
>>> as well).  This results in O(log(N)) rather than O(N) allocations as in
>>> your code that added a constant to the size.  Of course, some storage is
>>> wasted (that /might/ be retrieved by a final realloc down to the final
>>> size) but that's rarely significant.
>>>
>> So can we work it out?
> 
> What is "it"?
> 
>> Let's assume for the moment that the allocations have a semi-normal
>> distribution,
> 
> What allocations?  The allocations I talked about don't have that
> distribution.
> 
>> with negative values disallowed. Now ignoring the first few
>> values, if we have allocated, say, 1K, we ought to be able to predict the
>> value by integrating the distribution from 1k to infinity and taking the
>> mean.
> 
> I have no idea what you are talking about.  What "value" are you looking
> to calculate?
> 
We have a continuously growing buffer, and we want the best strategy for 
reallocations as the stream of characters comes at us. So, given we now 
how many characters have arrived, can we predict how many will arrive, 
and therefore ask for the best amount when we reallocate, so that we 
neither make too many reallocation (reallocate on every byte received) 
or ask for too much (demand SIZE_MAX memory when the first byte is 
received).?
Your strategy for avoiding these extremes is exponential growth. You 
allocate a small amount for the first few bytes. Then you use 
exponential growth, with a factor of ether 2 or 1.5. My question is 
whether or not we can be cuter. And of course we need to know the 
statistical distribution of the input files. And I'm assuming a 
semi-normal distribution, ignoring the files with small values, which we 
will allocate enough for anyway.

And so we integrate the distribution between the point we are at and 
infinity. Then we tkae the mean. And that gives us a best estimate of 
how many bytes are to come, and therefore how much to grow the buffer by.
-- 
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc