Article <v4p0dv$jeb2$1@dont-email.me>

Deutsch English Français Italiano
<v4p0dv$jeb2$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: Hex string literals (was Re: C23 thoughts and opinions)
Date: Mon, 17 Jun 2024 11:42:22 +0200
Organization: A noiseless patient Spider
Lines: 133
Message-ID: <v4p0dv$jeb2$1@dont-email.me>
References: <v2l828$18v7f$1@dont-email.me>
 <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
 <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com>
 <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
 <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
 <87y18047jk.fsf@nosuchdomain.example.com>
 <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me>
 <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me>
 <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
 <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me>
 <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me>
 <87v82b43h6.fsf@nosuchdomain.example.com>
 <87iky830v7.fsf_-_@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 17 Jun 2024 11:42:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="6eee6dfb82180fb756db1a7758fc4b5a";
	logging-data="637282"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/qWYMitG7weMSW7O61nhfGfEww+mr/6Vo="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:1ro7DLAtvhBJQM7eVyiaeizhsoc=
In-Reply-To: <87iky830v7.fsf_-_@nosuchdomain.example.com>
Content-Language: en-GB
Bytes: 8433

On 17/06/2024 01:48, Keith Thompson wrote:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
> [...]
>> uc"..." string literals might be made even simpler, for example allowing
>> only hex digits and not requiring \x (uc"01020304" rather than
>> uc"\x01\x02\x03\x04").  That's probably overkill.  uc"..."  literals
>> could be useful in other contexts, and programmers will want
>> flexibility.  Maybe something like hex"01020304" (embedded spaces could
>> be ignored) could be defined in addition to uc"\x01\x02\x03\x04".
> [...]
> 
> *If* hexadecimal string literals were to be added to a future version
> of the language, I think I have a syntax that I like better than
> what I suggested.
> 

I like your suggestion here.  It's very similar to mine, though with a 
prefix 0x"..." rather than b"...".  I'd be fine with either.

> Inspired by the existing syntax for integer and floating-point
> hex constants, I propose using a "0x" prefix.  0x"deadbeef" is an
> expression of type `const unsigned char[4]` (assuming CHAR_BIT==8),
> with values 0xde, 0xad, 0xbe, 0xef in that order.  Byte order is
> irrelevant; we're specifying byte values in order, not bytes of
> the representation of some larger type.  memcpy()ing 0x"deadbeef"
> to a uint32 might yield either 0xdeadbeef or uxefbeadde (or other
> more exotic possibilities).
> 
> Again, unlike other string literals, there is no implicit terminating
> null byte.  And I suggest making them const, since there's no
> existing code to break.
> 
> If CHAR_BIT==8, each byte is represented by two hex digits.  More
> generally, each byte is represented by (CHAR_BIT+3)/4 hex digits in
> the absence of whitespace.  Added whitespace marks the end of a byte,
> 0x"deadbeef" is 1, 2, 3, or 4 bytes if CHAR_BIT is 32, 16, 12, or 8
> respectively, but 0x"de ad be ef" is 4 bytes regardless of CHAR_BIT.
> 0x"" is a syntax error, since C doesn't support zero-length arrays.
> Anything between the quotes other than hex digits and spaces is a
> syntax error.

Fair enough.

> 
> 0x"dead beef" is still 4 bytes if CHAR_BIT==8; the space forces the
> end of a byte, but the usage of spaces doesn't have to be consistent.
> 
> This could be made more flexible by allowing various backslash
> escapes, but I'm not inclined to complicate it too much.

I would /definitely/ vote against any kind of backslash escapes here. 
That would mess up the simplicity of the syntax.

There might be benefits in having standardised macros that generate 
multiple copies of a given hex string and that sort of thing.

> 
> Note that the value of a (proposed) hex string literal is not a
> string unless it happens to end in zero.  I still use the term
> "string literal" because it's closely tied to existing string
> literal syntax, and existing string literals don't necessarily
> represent strings anyway ("embedded\0null\0characters").
> 
> Binary string literals 0b"11001001" might also be worth
> considering (that's of type `const unsigned char[1]`).  

That is /highly/ unlikely to be useful.  I work in the field that uses 
binary more than anywhere else, and where compilers have supported 
0b11001001 format for binary literals from /long/ before they reached 
the C standards - and I have very rarely seen them in practice.  When 
you do see them, they are in isolation - no one will write enough binary 
values in a row for such a format to be useful.  Hex strings are 
potentially useful because you are cutting { 0x12, 0x34, 0x45, 0x67 } to 
0x"12344567", which is a fair bit more compact.  For binary, the 
compaction is irrelevant and indeed counter-productive - binary literals 
became a lot more practical with the introduction of digit separators. 
(For standard C, these are from C23, but for C++ they came in C++14, and 
compilers have supported them as extensions in C.)


> Octal
> string literals 0"012 345 670" *might* be worth considering.

Most situations where octal could be useful died out many decades ago - 
it is vastly more likely that "012" is intended to mean 12 than 10.  No 
serious programming language supports a leading 0 as an indication of 
octal unless they are forced to do so by backwards compatibility, and 
many that used to support them have dropped them.

Having /some/ way to write octal can be helpful to old *nix programmers 
who prefer 046 to "S_IRUSR | S_IWUSR | S_IRGRP" in their chmod calls. 
(And to be fair, the constant names made in ancient history with short 
identifier length limits are pretty ugly.)  But it is not something to 
be encouraged, and I think there is no simple syntax that is obviously 
octal, and not easily mistaken for something else.

> <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3193.htm>
> proposes a new "0o123" syntax for octal constants; if that's adopted,
> I propose allowing 0o"..." and *not" 0"...".  I'm not sure whether
> to suggest hex only, or doing hex, octal, and binary for the sake
> of completeness.

Binary support is useless, and octal support would be worse than useless 
- even using an 0o rather than 0 prefix.  Completeness is not a 
justification for repeating old mistakes or complicating a good idea 
with features that will never be used.

> 
> What I'm trying to design here is a more straightforward way to
> represent raw (unsigned char[]) data in C code, largely but not
> exclusively for use by #embed.
> 

Personally, I'd see it as useful when /not/ using #embed.  I really do 
not think programmers will care what format #embed uses.  I don't share 
your concerns about efficiency of implementation, or that programmers 
need to know when it is efficient or not.  In almost all circumstances, 
C programmers never see or need to think about a separation between a C 
preprocessor and a post-processed C compiler - they are seen as a single 
entity, and can use whatever format is convenient between them.  And 
once you ignore the implementation details, which are an SEP, the way 
#embed is defined is better than a definition using these new hex blob 
strings.

But I have seen situations where it is useful to have embedded blobs 
directly in the source file, and then a compact solution would be 
convenient.  Currently most people use a list of hex constants, either 
byte for byte or sometimes in larger units, and hex strings like this 
would make it neater and more convenient.  (Attempts to use current 
string literals for the purpose look more like corruption in the file 
than source code.)