Article <v4kdmu$3hi90$1@dont-email.me>

Deutsch English Français Italiano
<v4kdmu$3hi90$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: C23 thoughts and opinions
Date: Sat, 15 Jun 2024 17:58:22 +0200
Organization: A noiseless patient Spider
Lines: 266
Message-ID: <v4kdmu$3hi90$1@dont-email.me>
References: <v2l828$18v7f$1@dont-email.me>
 <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
 <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com>
 <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
 <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
 <87y18047jk.fsf@nosuchdomain.example.com>
 <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me>
 <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me>
 <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
 <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me>
 <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me>
 <87v82b43h6.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 15 Jun 2024 17:58:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f678a482ffafce70c2ceef8ecfac3e10";
	logging-data="3721504"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18vxD4k44lXyJzm2KwhDGinv1dkwtQB1Iw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:6kMTKpL45xX88ypb0HN+ZpZI4NM=
In-Reply-To: <87v82b43h6.fsf@nosuchdomain.example.com>
Content-Language: en-GB
Bytes: 15561

On 14/06/2024 23:30, Keith Thompson wrote:
> David Brown <david.brown@hesbynett.no> writes:
>> On 28/05/2024 22:21, Keith Thompson wrote:
>>> David Brown <david.brown@hesbynett.no> writes:
>>>> On 28/05/2024 02:33, Keith Thompson wrote:
>>> [...]
>>>>> Without some kind of programmer control, I'm concerned that the rules
>>>>> for defining an array so #embed will be correctly optimized will be
>>>>> spread as lore rather than being specified anywhere.
>>>>
>>>> They might, but I really do not think that is so important, since they
>>>> will not affect the generated results.
>>> Right, it won't affect the generated results (assuming I use it
>>> correctly).  Unless I use `#embed optimize(true)` to initialize
>>> a struct with varying member sizes, but that's my fault because I
>>> asked for it.
>>
>> I am still not understanding your point.  (I am confident that you
>> have a point, even if I don't get it.)
>>
>> I cannot see why there would be any need or use of manually adding
>> optimisation hints or controls in the source code.  I cannot see why
>> the there is any possibility of getting incorrect results in any way.
>>
>>> The point is compile-timer performance, and perhaps even the ability
>>> to compile at all.
>>> I'm thinking about hypothetical cases where I want to embed a
>>> *very* large file and parsing the comma-delimited sequence could
>>> have unacceptable compile-time performance, perhaps even causing
>>> a compile-time stack overflow depending on how the parser works.
>>> Every time the compiler sees #embed, it has to decide whether to
>>> optimize it or not, and the decision criteria are not specified
>>> anywhere (not at all in the standard, perhaps not clearly in the
>>> compiler's documentation).
>>>
>>
>> Yes, I agree with that.  And this is how it should be - this is not
>> something that should be specified.  The C standards give minimum
>> requirements for things like the number of identifiers or the length
>> of lines.  But pretty much all compilers, for most of the "translation
>> limits", say they are "limited by the memory of the host computer".
>> The same will apply to #embed.  And some compilers will cope better
>> than others with huge #embed's, some will be faster, some more memory
>> efficient.  Some will change from version to version.  This is not
>> something that can sensibly be specified or formalized - like pretty
>> much everything in regard to compilation time, each compiler does the
>> best it can without any specifications.  I'd expect compiler reference
>> manuals might have hints, such as saying #embed is fastest with
>> unsigned char arrays (or whatever), but no more than that.
>>
>> But again - I see no reason for manual optimisation hints, and no
>> reason for any possible errors.
>>
>> Let me outline a possible strategy for a compiler like gcc.  (I have
>> not looked at the prototype implementations from thephd, nor any gcc
>> developer discussions.)
>>
>> gcc splits the C pre-processor and the compiler itself, and
>> (currently) communicates dataflow in only one direction, via a
>> temporary file or a pipe.  But the "gcc" (or "g++", according to
>> preference) driver program calls and coordinates the two programs.
>>
>> If the pre-processor is called stand-alone, then it will generate a
>> comma-separated list of integers, helpfully split over multiple lines
>> of reasonable size.  This will clearly always be correct, and always
>> work, within limits of a compiler's translation limits.
>>
>> But when the gcc driver calls it, it will have a flag indicating that
>> the target compiler is gcc and supports an extended pre-processed
>> syntax (and also that the source is C23 - after all, the C
>> pre-processor can be used as a macro processor for other files with no
>> relation to C).  Now the pre-processor has a lot more freedom.
>> Whenever it meets an #embed directive, it can generate a line :
>>
>> #embed_data 123456
>>
>> followed in the file by 123456 (or whatever) bytes of binary data.
>> The C compiler, when parsing this file, will pull that in as a single
>> blob. Then it is up to the C compiler - which knows how the #embed
>> data will be used - to tell if the these bytes should be used as
>> parameters to a macro, initialisation for a char array, or whatever.
>> And it can use them as efficiently as practically possible.  (It is
>> probably only worth using this for #embed data over a certain size -
>> smaller #embed's could just generate the integer sequences.)
>>
>> Nowhere in this is there any call of manual optimisation hints, nor
>> any risk of incorrect results.
> 
> I've kept this on the back burner for a couple of weeks.  I'm finally
> getting around to posting a followup.
> 
> I'm not particular concerned about compilers processing #embed
> incorrectly.  It's conceivable that a compiler could incorrectly decide
> that it can optimize a particular #embed directive, but I expect
> compilers to be conservative, falling back to the specified behavior if
> they can't *prove* that an optimization is safe.
> 

I'd expect that too.  (Of course there's always the risk of bugs with 
weird use-case)

> I see two conceptual problems with #embed as it's currently defined in
> N3220.
> 
> First, there's a possible compile-time performance issue for very large
> embedded files.  The (draft) standard calls for #embed to expand to a
> comma-separated list of integer constant expressions.  (I'm not sure why
> it didn't specify integer constants.)
> 
> My objection is based on the possibility that #embed for a *very* large
> file might result in unacceptable time and memory usage during compile
> time.  I haven't looked into how existing compilers handle large
> initializers, but I can imagine that parsing such a list might consume
> more than O(N) time and/or memory, or at least O(N) with a large
> constant.  (If parsing long lists of integer constants is expensive for
> some compiler, this could be a motivation to optimize that particular
> case.)

The point of #embed is to get O(N) scaling - or at least, much closer to 
that than compilers do today with an #include of a list of numbers (or 
even a string literal).  There is little doubt that a big enough #embed 
file will consume time and memory that is unacceptable, at least for 
some people - all you need is to pick a file bigger than your computer's 
memory, and you can be reasonably confident that it will be problematic. 
  But it also seems reasonable to expect that if a file is big enough to 
cause trouble for #embed, then any other method of including it in a C 
file will be at least as bad and probably /much/ worse.

At worst, #embed is going to be no less efficient than today's solution, 
and at best it will be significantly more efficient.  I don't think it 
is fair to object to it because a given implementation might not reach 
theoretical optimum efficiencies.

> 
> The intent of #embed is to copy the contents of a file at compile time
> into an array of unsigned char -- but it's specified in a roundabout way
> that requires bizarre usages to work "correctly".  

That is one expected use, and will probably be the biggest use by a fair 
way, but it is not the only possible use.  The specification lets you 
have more flexibility.  For example, I have a project where I include a 
number of files in a structure with a number of unsigned char arrays, 
amongst other data - a simpler #embed solution that forced you to have 
an unsigned char array might not work with that.  (The project predates 
#embed and uses a Python script to generate the data.)

> I expect at least
> some compilers to optimize #embed for better compile-time performance,
> but that requires them to determine when optimization is permitted with
> no advice from the standard about how to do that.  That's going to be
> moderately difficult for compiler implementers; I'm not too concerned
> about that.  But it also imposes a burden on programmers, who will have
> to use trial and error to determine how to ensure a #embed is optimized.
> 

I am entirely confident that major compiler vendors will optimise the 
case of initialising char arrays.  For anything else, who cares?  It is 
unlikely that you'd use #embed for other purposes with files that are 
big enough for unoptimised implementations to be unreasonably slow.  And 
if that does turn out to be a problem in practice, then you /know/ you 
have huge files and are doing something weird, and you can use something 
other than #embed for the purpose in the same way you do today.

Of prime importance is /correctness/ - #embed should give the results 
you expect, and I can't see that being a problem.  Outside that, #embed 
is always going to be at least as efficient as existing solutions, and 
usually much faster for cases that matter.

========== REMAINDER OF ARTICLE TRUNCATED ==========