Deutsch English Français Italiano |
<87v82b43h6.fsf@nosuchdomain.example.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson <Keith.S.Thompson+u@gmail.com> Newsgroups: comp.lang.c Subject: Re: C23 thoughts and opinions Date: Fri, 14 Jun 2024 14:30:13 -0700 Organization: None to speak of Lines: 181 Message-ID: <87v82b43h6.fsf@nosuchdomain.example.com> References: <v2l828$18v7f$1@dont-email.me> <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com> <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com> <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com> <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me> <87y18047jk.fsf@nosuchdomain.example.com> <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me> <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me> <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me> <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me> <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Fri, 14 Jun 2024 23:30:19 +0200 (CEST) Injection-Info: dont-email.me; posting-host="10f324c947246626491173dedfdc5917"; logging-data="3225647"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+UQ6pJZ1ml4dJxt1vW9G5U" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) Cancel-Lock: sha1:hZA5kHb0Jc9X1RXqbCQYuluopTA= sha1:Uvv6NGQXNfO2a1Irqe2gcPQv2Go= Bytes: 11331 David Brown <david.brown@hesbynett.no> writes: > On 28/05/2024 22:21, Keith Thompson wrote: >> David Brown <david.brown@hesbynett.no> writes: >>> On 28/05/2024 02:33, Keith Thompson wrote: >> [...] >>>> Without some kind of programmer control, I'm concerned that the rules >>>> for defining an array so #embed will be correctly optimized will be >>>> spread as lore rather than being specified anywhere. >>> >>> They might, but I really do not think that is so important, since they >>> will not affect the generated results. >> Right, it won't affect the generated results (assuming I use it >> correctly). Unless I use `#embed optimize(true)` to initialize >> a struct with varying member sizes, but that's my fault because I >> asked for it. > > I am still not understanding your point. (I am confident that you > have a point, even if I don't get it.) > > I cannot see why there would be any need or use of manually adding > optimisation hints or controls in the source code. I cannot see why > the there is any possibility of getting incorrect results in any way. > >> The point is compile-timer performance, and perhaps even the ability >> to compile at all. >> I'm thinking about hypothetical cases where I want to embed a >> *very* large file and parsing the comma-delimited sequence could >> have unacceptable compile-time performance, perhaps even causing >> a compile-time stack overflow depending on how the parser works. >> Every time the compiler sees #embed, it has to decide whether to >> optimize it or not, and the decision criteria are not specified >> anywhere (not at all in the standard, perhaps not clearly in the >> compiler's documentation). >> > > Yes, I agree with that. And this is how it should be - this is not > something that should be specified. The C standards give minimum > requirements for things like the number of identifiers or the length > of lines. But pretty much all compilers, for most of the "translation > limits", say they are "limited by the memory of the host computer". > The same will apply to #embed. And some compilers will cope better > than others with huge #embed's, some will be faster, some more memory > efficient. Some will change from version to version. This is not > something that can sensibly be specified or formalized - like pretty > much everything in regard to compilation time, each compiler does the > best it can without any specifications. I'd expect compiler reference > manuals might have hints, such as saying #embed is fastest with > unsigned char arrays (or whatever), but no more than that. > > But again - I see no reason for manual optimisation hints, and no > reason for any possible errors. > > Let me outline a possible strategy for a compiler like gcc. (I have > not looked at the prototype implementations from thephd, nor any gcc > developer discussions.) > > gcc splits the C pre-processor and the compiler itself, and > (currently) communicates dataflow in only one direction, via a > temporary file or a pipe. But the "gcc" (or "g++", according to > preference) driver program calls and coordinates the two programs. > > If the pre-processor is called stand-alone, then it will generate a > comma-separated list of integers, helpfully split over multiple lines > of reasonable size. This will clearly always be correct, and always > work, within limits of a compiler's translation limits. > > But when the gcc driver calls it, it will have a flag indicating that > the target compiler is gcc and supports an extended pre-processed > syntax (and also that the source is C23 - after all, the C > pre-processor can be used as a macro processor for other files with no > relation to C). Now the pre-processor has a lot more freedom. > Whenever it meets an #embed directive, it can generate a line : > > #embed_data 123456 > > followed in the file by 123456 (or whatever) bytes of binary data. > The C compiler, when parsing this file, will pull that in as a single > blob. Then it is up to the C compiler - which knows how the #embed > data will be used - to tell if the these bytes should be used as > parameters to a macro, initialisation for a char array, or whatever. > And it can use them as efficiently as practically possible. (It is > probably only worth using this for #embed data over a certain size - > smaller #embed's could just generate the integer sequences.) > > Nowhere in this is there any call of manual optimisation hints, nor > any risk of incorrect results. I've kept this on the back burner for a couple of weeks. I'm finally getting around to posting a followup. I'm not particular concerned about compilers processing #embed incorrectly. It's conceivable that a compiler could incorrectly decide that it can optimize a particular #embed directive, but I expect compilers to be conservative, falling back to the specified behavior if they can't *prove* that an optimization is safe. I see two conceptual problems with #embed as it's currently defined in N3220. First, there's a possible compile-time performance issue for very large embedded files. The (draft) standard calls for #embed to expand to a comma-separated list of integer constant expressions. (I'm not sure why it didn't specify integer constants.) My objection is based on the possibility that #embed for a *very* large file might result in unacceptable time and memory usage during compile time. I haven't looked into how existing compilers handle large initializers, but I can imagine that parsing such a list might consume more than O(N) time and/or memory, or at least O(N) with a large constant. (If parsing long lists of integer constants is expensive for some compiler, this could be a motivation to optimize that particular case.) The intent of #embed is to copy the contents of a file at compile time into an array of unsigned char -- but it's specified in a roundabout way that requires bizarre usages to work "correctly". I expect at least some compilers to optimize #embed for better compile-time performance, but that requires them to determine when optimization is permitted with no advice from the standard about how to do that. That's going to be moderately difficult for compiler implementers; I'm not too concerned about that. But it also imposes a burden on programmers, who will have to use trial and error to determine how to ensure a #embed is optimized. This all assumes that a naive #embed implementation is going to cause real problems for very large embedded files (compile-time stack overflows, unreasonably long compile times, or just using so much memory that system performance is affected). If it turns out that this isn't the case, then that objection is mostly addressed. My other objection is that it's conceptually messy. The expected use case is in an initializer for an array of unsigned char, but there are no restrictions on where it can be used. As a programmer, I want to copy a file verbatim into an unsigned char array, but at least conceptually #embed translates the file contents into a long sequence of expressions which are then processed as C code to recreate the raw data. There are bizarre cases (like my previous example initializing a struct with members of various types) that are required to work. #embed is a preprocessor directive, but determining whether it can be optimized requires feedback from later compiler phases. It's doable, but it's *ugly*. Now that it's too late to change the definition, I've thought of something that I think would have been a better way to specify #embed. Define a new kind of string literal, with a "uc" prefix. `uc"foo"` is of type `unsigned char[3]`. (Or `const unsigned char[3]`, if that's not too radical.) Unlike other string literals, there is no implicit terminating '\0'. Arbitrary byte values can of course be specified in hexadecimal: uc"\x01\x02\x03\x04". Since there's no terminating null character and C doesn't support zero-sized objects, uc"" is a syntax error. uc"..." string literals might be made even simpler, for example allowing only hex digits and not requiring \x (uc"01020304" rather than uc"\x01\x02\x03\x04"). That's probably overkill. uc"..." literals could be useful in other contexts, and programmers will want flexibility. Maybe something like hex"01020304" (embedded spaces could be ignored) could be defined in addition to uc"\x01\x02\x03\x04". Specify that #embed expands to a sequence of one or more uc string literals (or hex string literals if that's added), separated by whitespace. If the embedded file might be empty, use the existing is_empty() embed parameter. Without is_empty, #embed of an empty file will expand to uc"", a syntax error. Since a string literal is a single token, parsing it is likely to be ========== REMAINDER OF ARTICLE TRUNCATED ==========