Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: Hex string literals (was Re: C23 thoughts and opinions)
Date: Tue, 18 Jun 2024 15:54:15 +0200
Organization: A noiseless patient Spider
Lines: 173
Message-ID: <v4s3i8$1cjdr$1@dont-email.me>
References: <v2l828$18v7f$1@dont-email.me>
 <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
 <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com>
 <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
 <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
 <87y18047jk.fsf@nosuchdomain.example.com>
 <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me>
 <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me>
 <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
 <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me>
 <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me>
 <87v82b43h6.fsf@nosuchdomain.example.com>
 <87iky830v7.fsf_-_@nosuchdomain.example.com> <v4p0dv$jeb2$1@dont-email.me>
 <87cyof14rd.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 18 Jun 2024 15:54:16 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="000ac22a82b477e7b73d30c4bbbc814d";
	logging-data="1461691"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18LuLvF6f8d8XEVGEfN4dzXL9u2iS7BylU="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:/QDyZBvOWtwLiIDlKzv0xDCSLDk=
In-Reply-To: <87cyof14rd.fsf@nosuchdomain.example.com>
Content-Language: en-GB
Bytes: 9554

On 18/06/2024 02:19, Keith Thompson wrote:
> David Brown <david.brown@hesbynett.no> writes:
>> On 17/06/2024 01:48, Keith Thompson wrote:
> [...]
>>                                                             For binary,
>> the compaction is irrelevant and indeed counter-productive - binary
>> literals became a lot more practical with the introduction of digit
>> separators. (For standard C, these are from C23, but for C++ they came
>> in C++14, and compilers have supported them as extensions in C.)
> 
> I forgot about digit separators.
> 
> C23 adds the option to use apostrophes as separators in numeric
> constants: 123'456'789 or 0xdead'beef, for example.  (This is
> borrowed from C++.  Commas are more commonly used in real life,
> at least in my experience, but that wouldn't work given the other
> meanings of commas.)

Commas would be entirely unsuitable here, since half the world uses 
decimal commas rather than decimal points.  I think underscores are a 
nicer choice, used by many languages, but C++ could not use underscores 
due to their use in user-defined literals, and C followed C++.

> 
> I briefly considered that, for consistency, we might want to
> use apostrophes rather than spaces in hex string constants:
> 0x"de'ad'be'ef".  But since digit separators are purely decorative,
> and spaces in my proposed hex string literals are semantically
> significant (they terminate a byte), I'll stick with spaces.

I think you were using spaces as byte separators, whereas apostrophes 
should be completely ignored when parsing.

> 
> You could even write 0x"0 0 0 0" to denote 4 zero bytes (where
> "0x0000" is 2 bytes) but 0x"00 00 00 00" or "0x00000000" is probably
> clearer.
> 
> I think allowing both spaces and apostrophes would be too confusing.
> 

Fair enough.

>>> Octal
>>> string literals 0"012 345 670" *might* be worth considering.
>>
>> Most situations where octal could be useful died out many decades ago
>> - it is vastly more likely that "012" is intended to mean 12 than 10.
>> No serious programming language supports a leading 0 as an indication
>> of octal unless they are forced to do so by backwards compatibility,
>> and many that used to support them have dropped them.
>>
>> Having /some/ way to write octal can be helpful to old *nix
>> programmers who prefer 046 to "S_IRUSR | S_IWUSR | S_IRGRP" in their
>> chmod calls. (And to be fair, the constant names made in ancient
>> history with short identifier length limits are pretty ugly.)  But it
>> is not something to be encouraged, and I think there is no simple
>> syntax that is obviously octal, and not easily mistaken for something
>> else.
> 
> There is, the proposed "0o" prefix.  It's already supported in both Perl
> and Python, and likely other languages.

Some languages apparently use 0q, because 0o might be confusing in some 
fonts.  I'm not sure I agree, and 0q is not very intuitive.  I'd rate 0o 
as vastly better than 0, but I would not bother with supporting it in a 
new feature like this.

> 
>>> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3193.htm>
>>> proposes a new "0o123" syntax for octal constants; if that's adopted,
>>> I propose allowing 0o"..." and *not" 0"...".  I'm not sure whether
>>> to suggest hex only, or doing hex, octal, and binary for the sake
>>> of completeness.
>>
>> Binary support is useless, and octal support would be worse than
>> useless - even using an 0o rather than 0 prefix.  Completeness is not
>> a justification for repeating old mistakes or complicating a good idea
>> with features that will never be used.
> 
> I like binary integer constants (0b11001001), but I suppose I
> agree that they're not useful for larger chunks of data.  

Perhaps I am so used to binary and hex that I convert without thinking, 
and thus rarely need binary.

The one place I find binary useful is for bitmap fonts.  I use these a 
lot less than I used to, but sometimes you need to make new characters 
for an old-style low resolution LCD screen, and then binary constants 
can be useful.  Often, however, I prefer characters like . and @ rather 
than 0 and 1 as it makes the contrast much higher.

> I have no
> problem supporting only hex string literals, not binary or octal --
> but I'd have no problem with having all three if anyone thinks that
> would be sufficiently useful.
> 

Fair enough.

>>> What I'm trying to design here is a more straightforward way to
>>> represent raw (unsigned char[]) data in C code, largely but not
>>> exclusively for use by #embed.
>>
>> Personally, I'd see it as useful when /not/ using #embed.  I really do
>> not think programmers will care what format #embed uses.  I don't
>> share your concerns about efficiency of implementation, or that
>> programmers need to know when it is efficient or not.  In almost all
>> circumstances, C programmers never see or need to think about a
>> separation between a C preprocessor and a post-processed C compiler -
>> they are seen as a single entity, and can use whatever format is
>> convenient between them.  And once you ignore the implementation
>> details, which are an SEP, the way #embed is defined is better than a
>> definition using these new hex blob strings.
> 
> I think my main problem with the current #embed is that it's
> conceptually messy.  I'm probably an outlier in how much I care about
> that.
> 
> It's not clear whether the problems with the current definition of
> #embed are as serious as I suggest; you clearly think they aren't.  

I am still not convinced that there /are/ problems, never mind serious 
problems, nor that it it is "conceptually messy".  (I'd care about that 
too, at least to some extent.)  I don't think the feature will lead to 
any dramatic changes in the way I work, but it could sometimes be 
convenient and avoid the need of external scripts or programs in a build 
file.

> But
> even if the current #embed is ok, I think adding hex string literals and
> adding a language defined embed parameter that specifies using hex
> string literals rather than a list of integer constant expressions would
> be useful.

Agreed.

>  Among other things, it lets the programmer specify that a
> given #embed is only to be used to initialize an array of unsigned char.
> 
> For example, given a 4-byte foo.dat containing bytes 1, 2, 3, and 4:
>      const unsigned char buf[] = {
>          #embed "foo.dat"
>      };
> would expand to something like:
>      const unsigned char buf[] = {
>          1, 2, 3, 4
>      };
> (and the same if buf is of type int[] or double[]), while this:
>      const unsigned char buf[] =
>          #embed "foo.dat" hex(true) // proposed new parameter
>      ;
> would expand to something like:
>      const unsigned char buf[] =
>          0x"01020304"
>      ;
> (and would result in an error if buf is of type int[] or double[]).
> 
> [...]
> 

I don't see the benefit here.  This is C - the programmer is expected to 
get the type right, and I think it would be rare to get it wrong (or 
worse wrong than forgetting "unsigned") in a case like this.  So the 
extra type checking here has little or no benefit.  (In general, I am a 
========== REMAINDER OF ARTICLE TRUNCATED ==========