Article <v4pl6q$nn9o$1@dont-email.me>

Deutsch English Français Italiano
<v4pl6q$nn9o$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Malcolm McLean <malcolm.arthur.mclean@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: realloc() - frequency, conditions, or experiences about
 relocation?
Date: Mon, 17 Jun 2024 16:36:57 +0100
Organization: A noiseless patient Spider
Lines: 127
Message-ID: <v4pl6q$nn9o$1@dont-email.me>
References: <v4ojs8$gvji$1@dont-email.me> <875xu8vsen.fsf@bsb.me.uk>
 <v4ovqf$j5hq$1@dont-email.me> <87zfrjvqp6.fsf@bsb.me.uk>
 <v4pb4v$lhgk$1@dont-email.me> <87tthrvdtg.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jun 2024 17:36:59 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f27d645ef399f596eab7376e05efef48";
	logging-data="777528"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19ErpMsDws0af1qDI3+Tb9KEvhvu7nzNSs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:ALtfStmMaQYyNc8q7OPL/l+bMDs=
In-Reply-To: <87tthrvdtg.fsf@bsb.me.uk>
Content-Language: en-GB
Bytes: 7768

On 17/06/2024 15:33, Ben Bacarisse wrote:
> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
> 
>> On 17/06/2024 10:55, Ben Bacarisse wrote:
>>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>
>>>> On 17/06/2024 10:18, Ben Bacarisse wrote:
>>>>> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
>>>>>
>>>>>> In a recent thread realloc() was a substantial part of the discussion.
>>>>>> "Occasionally" the increased data storage will be relocated along
>>>>>> with the previously stored data. On huge data sets that might be a
>>>>>> performance factor. Is there any experience or are there any concrete
>>>>>> factors about the conditions when this relocation happens? - I could
>>>>>> imagine that it's no issue as long as you're in some kB buffer range,
>>>>>> but if, say, we're using realloc() to substantially increase buffers
>>>>>> often it might be an issue to consider. It would be good to get some
>>>>>> feeling about that internal.
>>>>> There is obviously a cost, but there is (usually) no alternative if
>>>>> contiguous storage is required.  In practice, the cost is usually
>>>>> moderate and can be very effectively managed by using an exponential
>>>>> allocation scheme: at every reallocation multiply the storage space by
>>>>> some factor greater than 1 (I often use 3/2, but doubling is often used
>>>>> as well).  This results in O(log(N)) rather than O(N) allocations as in
>>>>> your code that added a constant to the size.  Of course, some storage is
>>>>> wasted (that /might/ be retrieved by a final realloc down to the final
>>>>> size) but that's rarely significant.
>>>>>
>>>> So can we work it out?
>>> What is "it"?
>>>
>>>> Let's assume for the moment that the allocations have a semi-normal
>>>> distribution,
>>> What allocations?  The allocations I talked about don't have that
>>> distribution.
>>>
>>>> with negative values disallowed. Now ignoring the first few
>>>> values, if we have allocated, say, 1K, we ought to be able to predict the
>>>> value by integrating the distribution from 1k to infinity and taking the
>>>> mean.
>>> I have no idea what you are talking about.  What "value" are you looking
>>> to calculate?
>>>
>> We have a continuously growing buffer, and we want the best strategy for
>> reallocations as the stream of characters comes at us. So, given we now how
>> many characters have arrived, can we predict how many will arrive, and
>> therefore ask for the best amount when we reallocate, so that we neither
>> make too many reallocation (reallocate on every byte received) or ask for
>> too much (demand SIZE_MAX memory when the first byte is received).?
> 
> Obviously not, or we'd use the prediction.  You question was probably
> rhetorical, but it didn't read that way.
> 
>> Your strategy for avoiding these extremes is exponential growth.
> 
> It's odd to call it mine.  It's very widely know and used.  "The one I
> mentioned" might be less confusing description.
> 
>> You
>> allocate a small amount for the first few bytes. Then you use exponential
>> growth, with a factor of ether 2 or 1.5. My question is whether or not we
>> can be cuter. And of course we need to know the statistical distribution of
>> the input files. And I'm assuming a semi-normal distribution, ignoring the
>> files with small values, which we will allocate enough for anyway.
>>
>> And so we integrate the distribution between the point we are at and
>> infinity. Then we tkae the mean. And that gives us a best estimate of how
>> many bytes are to come, and therefore how much to grow the buffer by.
> 
> I would be surprised if that were worth the effort at run time.  A
> static analysis of "typical" input sizes might be interesting as that
> could be used to get an estimate of good factors to use, but anything
> more complicated than maybe a few factors (e.g. doubling up to 1MB then
> 3/2 thereafter) is likely to be too messy to useful.
>
There's virualy no run-time effort, unless you ask caller to pass in a 
customised distribution, which you analyse on the fly, which would be 
quite a bit of work.
All the work is done beforehand. We need a statistical distribution of 
the files sizes of the files we are interesed in. So, probably, text 
files on personal computers. Then we'll excude the small files, say 
under 1kb which will have an odd distribution for various reasons, and 
which we are not interested in as we can easily afford 1k as an initial 
buffer.

And we're probably looking at a semi- normal, maybe log- normal 
distribution. There's no reason to suspect it would be anything odd. And 
with the normal distribution there is no closed form integral, but 
tables of integrals are published.

So we convert 1K to a Z-score, integrate from that to infinity, halve 
the result, and that gives us an estimate of the most likely file size - 
having established that the file is over 1k, half will be below and half 
above this size. So that's the next amount to realloc. Say, for the sake 
of argument, 4K. Then we do the same thing, starting from 4k, and 
working out the most likely file size, given that the file is over 4K. 
Now the disribution tends to flatten out towards the tail, so if best 
guess, given at least 1K, was 4K, best guess diven 4k, won't be 8K. It 
will be 10k, maybe 12k. Do the same again for 12k. And we'll get a 
series of numbers like this.

1k, 4k, 10k, 20k, 50k, 120k, 500k, 2MB, 8MB ...

and so on, rapidly increasing to SIZE_MAX. And then at runtime we just 
hardcode those in, it's a lookup table with not too many entries.

Becuase we've chosen the mean, half the time you will reallocate. You 
can easily fine tune the strategy by choosing a proportion other than 
0.5, depending on whether saving memory or reducing allocations is more 
important to you.

and the hard part is getting some real statistics to work on.
 >
> Also, the cost of reallocations is not constant.  Larger ones are
> usually more costly than small ones, so if one were going to a lot of
> effort to make run-time guesses, that cost should be factored in as
> well.
> 
Unfortunately yes. Real optimisation problems can be almost impossible 
for reasons like this. iF the cost of reallocations isn't constant, 
tou've got to put in correctiong factors, and then what was a fairly 
simple procedure becomes difficult.

-- 
Check out my hobby project.
http://malcolmmclean.github.io/babyxrc