Article <87plt8yxgn.fsf@nosuchdomain.example.com>

Deutsch English Français Italiano
<87plt8yxgn.fsf@nosuchdomain.example.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith Thompson <Keith.S.Thompson+u@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: C23 thoughts and opinions
Date: Sun, 26 May 2024 16:17:44 -0700
Organization: None to speak of
Lines: 349
Message-ID: <87plt8yxgn.fsf@nosuchdomain.example.com>
References: <v2l828$18v7f$1@dont-email.me>
	<00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
	<v2lji1$1bbcp$1@dont-email.me>
	<87msoh5uh6.fsf@nosuchdomain.example.com>
	<f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
	<v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
	<87y18047jk.fsf@nosuchdomain.example.com>
	<87msoe1xxo.fsf@nosuchdomain.example.com>
	<v2sh19$2rle2$2@dont-email.me>
	<87ikz11osy.fsf@nosuchdomain.example.com>
	<v2v59g$3cr0f$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 27 May 2024 01:17:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a446b217e538d5d14d4fc3ccaf9433bd";
	logging-data="3858400"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+wDJR0ykT7NQcAmEuKZ6F2"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:h4Hw4WM4hvjGAlC+dqlIUWbVz90=
	sha1:YrVLGpwB6kKAYwH1LW0SHLExabk=
Bytes: 17829

David Brown <david.brown@hesbynett.no> writes:
> On 26/05/2024 00:58, Keith Thompson wrote:
>> David Brown <david.brown@hesbynett.no> writes:
>>> On 25/05/2024 03:29, Keith Thompson wrote:
>>>> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>>>>> David Brown <david.brown@hesbynett.no> writes:
>>>>>> On 23/05/2024 14:11, bart wrote:
>>>>> [...]
>>>>>>> 'embed' was discussed a few months ago. I disagreed with the poor
>>>>>>> way it was to be implemented: 'embed' notionally generates a list of
>>>>>>> comma-separated numbers as tokens, where you have to take care of
>>>>>>> any trailing zero yourself if needed. It would also be hopelessly
>>>>>>> inefficient if actually implemented like that.
>>>>>>
>>>>>> Fortunately, it is /not/ actually implemented like that - it is only
>>>>>> implemented "as if" it were like that.  Real prototype implementations
>>>>>> (for gcc and clang - I don't know about other tools) are extremely
>>>>>> efficient at handling #embed.  And the comma-separated numbers can be
>>>>>> more flexible in less common use-cases.
>>>>> [...]
>>>>>
>>>>> I'm aware of a proposed implementation for clang:
>>>>>
>>>>> https://github.com/llvm/llvm-project/pull/68620
>>>>> https://github.com/ThePhD/llvm-project
>>>>>
>>>>> I'm currently cloning the git repo, with the aim of building it so I can
>>>>> try it out and test some corner cases.  It will take a while.
>>>>>
>>>>> I'm not aware of any prototype implementation for gcc.  If you are, I'd
>>>>> be very interested in trying it out.
>>>>>
>>>>> (And thanks for starting this thread!)
>>>> I've built this from source, and it mostly works.  I haven't seen it
>>>> do
>>>> any optimization; the `#embed` directive expands to a sequence of
>>>> comma-separated integer constants.
>>>> Which means that this:
>>>> #include <stdio.h>
>>>> int main(void) {
>>>>       struct foo {
>>>>           unsigned char a;
>>>>           unsigned short b;
>>>>           unsigned int c;
>>>>           double d;
>>>>       };
>>>>       struct foo obj = {
>>>> #embed "foo.dat"
>>>>       };
>>>>       printf("a=%d b=%d c=%d d=%f\n", obj.a, obj.b, obj.c, obj.d);
>>>> }
>>>> given "foo.dat" containing bytes with values 1, 2, 3, and 4,
>>>> produces
>>>> this output:
>>>> a=1 b=2 c=3 d=4.000000
>>>
>>> That is what you would expect by the way #embed is specified.  You
>>> would not expect to see any "optimisation", since optimisations should
>>> not change the results (apparent from choosing between alternative
>>> valid results).
>>>
>>> Where you will see the optimisation difference is between :
>>>
>>> 	const int xs[] = {
>>> #embed "x.dat"
>>> 	};
>>>
>>> and
>>>
>>> 	const int xs[] = {
>>> #include "x.csv"
>>> 	};
>>>
>>>
>>> where "x.dat" is a large binary file, and "x.csv" is the same data as
>>> comma-separated values.  The #embed version will compile very much
>>> faster, using far less memory.  /That/ is the optimisation.
>> Why would it compile faster?  #embed expands to something similar to
>> CSV, which still has to be parsed.
>
> No, it does /not/.  That's the /whole/ point of #embed, and the main
> motivation for its existence.  People have always managed to embed 
> binary source files into their binary output files - using linker
> tricks, or using xxd or other tools (common or specialised) to turn 
> binary files into initialisers for constant arrays (or structs).  I've
> done so myself on many projects, all integrated together in makefiles.
>
> #embed has two purposes.  One is to save you from using external tools
>  for that kind of thing.  The other is to do it more efficiently for
> big files.
>
> There are two ways this is done for examples like this.  One is that
> is that the compiler does /not/ turn each byte into a series of ASCII 
> digits for the number, then parse that number to get back to a byte.
> It jumps straight from byte in to byte out, possibly after expanding
> to a bigger type size if necessary.  Secondly, compilers typically
> track lots more information about each initialiser - such as the file,
> line and column number so that it can give you helpful messages if
> there is a value out of range, or too many or too few initialisers.
> With #embed, the compiler doesn't have to do any of that.
>
> The compiler will generate results /as if/ it had expanded the file to
> a list of numbers and parsed them.  But it will not do that in
> practice. (At least, not for more serious implementations - simple
> solutions might do so to get support implemented quickly.)

I'll start by acknowledging that the prototype information apparently
*does* optimize #embed when it can.  I was mistaken on that point.

#embed *must* expand to the standard-defined comma-delimited sequence in
*some* cases.

Which means that the piece of the compiler that implements #embed has to
recognize when it must generate that sequence, and when it can do
something more efficient.

>> Reference:
>> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3220.pdf>
>> 6.10.4.
>> The first one will probably initialize each int element of xs to a
>> single byte value extracted from x.dat.  Is that what you intended?
>
> Yes, if that's what the programmer wrote - though I agree that
> character types will be more common and will be the prime target for
> optimisation.
>
>> #embed works best with arrays of unsigned char.
>
> Sure, that will be a very common use.
>
>> If you mean that the #embed will expand to something other than the
>> sequence of integer constants, how does it know to do that in this
>> context?
>
> It knows because the compiler writers are actually quite smart.  The C
> standards may describe the translation process in a series of distinct 
> and independent phases, but that's not how it is done in practice.
> The key point is that the compiler knows how the sequence of integers
> is going to be used before it gets that far in the preprocessing.
>
> I'd expect implementations to have extremely fast implementations for
> initialising arrays of character types, and probably also for other 
> arrays of scaler types.  More complicated examples - such as
> parameters in a macro or function call - would probably use a
> fall-back of generating naïve lists of integer constants.

My problem is not just with how the compiler can figure out when it can
optimize, but how programmers are supposed to understand whatever rules
it uses.  Can I rely on the optimization being performed if I use a
typedef for unsigned char, or if I use an enumeration type whose
underlying type is unsigned char, or if I have initialization elements
befor and after the #embed directive?

Effective use of #embed requires too much "magic" for my taste --
particularly having the preprocessor rely on information from later
phases.  The semantics of #embed don't rely on that information, but
efficient use for large files does.

>> If you have a binary file containing a sequence of int values, you
>> can
>> use #embed to initialize an unsigned char array that's aliased with or
>> copied to the int array.
>> The *embed element width* is typically going to be CHAR_BIT bits by
>> default.  It can only be changed by an *implementation-defined* embed
>> parameter.  It seems odd that there's no standard way to specify the
>> element width.
>> It seems even more odd that the embed element width is
>> implementation defined and not set to CHAR_BIT by default.
>
> I agree.  But it may be left flexible for situations where the host
========== REMAINDER OF ARTICLE TRUNCATED ==========