Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Richard Harnden Newsgroups: comp.lang.c Subject: Re: realloc() - frequency, conditions, or experiences about relocation? Date: Mon, 17 Jun 2024 16:15:27 +0100 Organization: A noiseless patient Spider Lines: 100 Message-ID: References: <875xu8vsen.fsf@bsb.me.uk> <87zfrjvqp6.fsf@bsb.me.uk> <87tthrvdtg.fsf@bsb.me.uk> Reply-To: richard.harnden@invalid.com MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 17 Jun 2024 17:15:27 +0200 (CEST) Injection-Info: dont-email.me; posting-host="adebfc758c55f317ee4a5e3ad2ba1ae4"; logging-data="763153"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18DEZbDBn2/EW7dF+Uy3L5HP8AllpeU+Eo=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:gaB1NPToFxo/ulraijCfCFyn8hU= Content-Language: en-GB In-Reply-To: <87tthrvdtg.fsf@bsb.me.uk> Bytes: 5850 On 17/06/2024 15:33, Ben Bacarisse wrote: > Malcolm McLean writes: > >> On 17/06/2024 10:55, Ben Bacarisse wrote: >>> Malcolm McLean writes: >>> >>>> On 17/06/2024 10:18, Ben Bacarisse wrote: >>>>> Janis Papanagnou writes: >>>>> >>>>>> In a recent thread realloc() was a substantial part of the discussion. >>>>>> "Occasionally" the increased data storage will be relocated along >>>>>> with the previously stored data. On huge data sets that might be a >>>>>> performance factor. Is there any experience or are there any concrete >>>>>> factors about the conditions when this relocation happens? - I could >>>>>> imagine that it's no issue as long as you're in some kB buffer range, >>>>>> but if, say, we're using realloc() to substantially increase buffers >>>>>> often it might be an issue to consider. It would be good to get some >>>>>> feeling about that internal. >>>>> There is obviously a cost, but there is (usually) no alternative if >>>>> contiguous storage is required. In practice, the cost is usually >>>>> moderate and can be very effectively managed by using an exponential >>>>> allocation scheme: at every reallocation multiply the storage space by >>>>> some factor greater than 1 (I often use 3/2, but doubling is often used >>>>> as well). This results in O(log(N)) rather than O(N) allocations as in >>>>> your code that added a constant to the size. Of course, some storage is >>>>> wasted (that /might/ be retrieved by a final realloc down to the final >>>>> size) but that's rarely significant. >>>>> >>>> So can we work it out? >>> What is "it"? >>> >>>> Let's assume for the moment that the allocations have a semi-normal >>>> distribution, >>> What allocations? The allocations I talked about don't have that >>> distribution. >>> >>>> with negative values disallowed. Now ignoring the first few >>>> values, if we have allocated, say, 1K, we ought to be able to predict the >>>> value by integrating the distribution from 1k to infinity and taking the >>>> mean. >>> I have no idea what you are talking about. What "value" are you looking >>> to calculate? >>> >> We have a continuously growing buffer, and we want the best strategy for >> reallocations as the stream of characters comes at us. So, given we now how >> many characters have arrived, can we predict how many will arrive, and >> therefore ask for the best amount when we reallocate, so that we neither >> make too many reallocation (reallocate on every byte received) or ask for >> too much (demand SIZE_MAX memory when the first byte is received).? > > Obviously not, or we'd use the prediction. You question was probably > rhetorical, but it didn't read that way. > >> Your strategy for avoiding these extremes is exponential growth. > > It's odd to call it mine. It's very widely know and used. "The one I > mentioned" might be less confusing description. > >> You >> allocate a small amount for the first few bytes. Then you use exponential >> growth, with a factor of ether 2 or 1.5. My question is whether or not we >> can be cuter. And of course we need to know the statistical distribution of >> the input files. And I'm assuming a semi-normal distribution, ignoring the >> files with small values, which we will allocate enough for anyway. >> >> And so we integrate the distribution between the point we are at and >> infinity. Then we tkae the mean. And that gives us a best estimate of how >> many bytes are to come, and therefore how much to grow the buffer by. > > I would be surprised if that were worth the effort at run time. A > static analysis of "typical" input sizes might be interesting as that > could be used to get an estimate of good factors to use, but anything > more complicated than maybe a few factors (e.g. doubling up to 1MB then > 3/2 thereafter) is likely to be too messy to useful. > > Also, the cost of reallocations is not constant. Larger ones are > usually more costly than small ones, so if one were going to a lot of > effort to make run-time guesses, that cost should be factored in as > well. > I usually keep track: struct { size_t used; size_t allocated; void *data; }; Then, if used + new_size is more than what's already been allocated then a realloc will be required. Start with an initial allocated size that's 'resonable' - the happy path will never need any reallocs. Otherwise multiply by some factor. Typicall I just double it.