Deutsch English Français Italiano |
<v4p0dv$jeb2$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: David Brown <david.brown@hesbynett.no> Newsgroups: comp.lang.c Subject: Re: Hex string literals (was Re: C23 thoughts and opinions) Date: Mon, 17 Jun 2024 11:42:22 +0200 Organization: A noiseless patient Spider Lines: 133 Message-ID: <v4p0dv$jeb2$1@dont-email.me> References: <v2l828$18v7f$1@dont-email.me> <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com> <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com> <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com> <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me> <87y18047jk.fsf@nosuchdomain.example.com> <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me> <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me> <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me> <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me> <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me> <87v82b43h6.fsf@nosuchdomain.example.com> <87iky830v7.fsf_-_@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 17 Jun 2024 11:42:23 +0200 (CEST) Injection-Info: dont-email.me; posting-host="6eee6dfb82180fb756db1a7758fc4b5a"; logging-data="637282"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/qWYMitG7weMSW7O61nhfGfEww+mr/6Vo=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cancel-Lock: sha1:1ro7DLAtvhBJQM7eVyiaeizhsoc= In-Reply-To: <87iky830v7.fsf_-_@nosuchdomain.example.com> Content-Language: en-GB Bytes: 8433 On 17/06/2024 01:48, Keith Thompson wrote: > Keith Thompson <Keith.S.Thompson+u@gmail.com> writes: > [...] >> uc"..." string literals might be made even simpler, for example allowing >> only hex digits and not requiring \x (uc"01020304" rather than >> uc"\x01\x02\x03\x04"). That's probably overkill. uc"..." literals >> could be useful in other contexts, and programmers will want >> flexibility. Maybe something like hex"01020304" (embedded spaces could >> be ignored) could be defined in addition to uc"\x01\x02\x03\x04". > [...] > > *If* hexadecimal string literals were to be added to a future version > of the language, I think I have a syntax that I like better than > what I suggested. > I like your suggestion here. It's very similar to mine, though with a prefix 0x"..." rather than b"...". I'd be fine with either. > Inspired by the existing syntax for integer and floating-point > hex constants, I propose using a "0x" prefix. 0x"deadbeef" is an > expression of type `const unsigned char[4]` (assuming CHAR_BIT==8), > with values 0xde, 0xad, 0xbe, 0xef in that order. Byte order is > irrelevant; we're specifying byte values in order, not bytes of > the representation of some larger type. memcpy()ing 0x"deadbeef" > to a uint32 might yield either 0xdeadbeef or uxefbeadde (or other > more exotic possibilities). > > Again, unlike other string literals, there is no implicit terminating > null byte. And I suggest making them const, since there's no > existing code to break. > > If CHAR_BIT==8, each byte is represented by two hex digits. More > generally, each byte is represented by (CHAR_BIT+3)/4 hex digits in > the absence of whitespace. Added whitespace marks the end of a byte, > 0x"deadbeef" is 1, 2, 3, or 4 bytes if CHAR_BIT is 32, 16, 12, or 8 > respectively, but 0x"de ad be ef" is 4 bytes regardless of CHAR_BIT. > 0x"" is a syntax error, since C doesn't support zero-length arrays. > Anything between the quotes other than hex digits and spaces is a > syntax error. Fair enough. > > 0x"dead beef" is still 4 bytes if CHAR_BIT==8; the space forces the > end of a byte, but the usage of spaces doesn't have to be consistent. > > This could be made more flexible by allowing various backslash > escapes, but I'm not inclined to complicate it too much. I would /definitely/ vote against any kind of backslash escapes here. That would mess up the simplicity of the syntax. There might be benefits in having standardised macros that generate multiple copies of a given hex string and that sort of thing. > > Note that the value of a (proposed) hex string literal is not a > string unless it happens to end in zero. I still use the term > "string literal" because it's closely tied to existing string > literal syntax, and existing string literals don't necessarily > represent strings anyway ("embedded\0null\0characters"). > > Binary string literals 0b"11001001" might also be worth > considering (that's of type `const unsigned char[1]`). That is /highly/ unlikely to be useful. I work in the field that uses binary more than anywhere else, and where compilers have supported 0b11001001 format for binary literals from /long/ before they reached the C standards - and I have very rarely seen them in practice. When you do see them, they are in isolation - no one will write enough binary values in a row for such a format to be useful. Hex strings are potentially useful because you are cutting { 0x12, 0x34, 0x45, 0x67 } to 0x"12344567", which is a fair bit more compact. For binary, the compaction is irrelevant and indeed counter-productive - binary literals became a lot more practical with the introduction of digit separators. (For standard C, these are from C23, but for C++ they came in C++14, and compilers have supported them as extensions in C.) > Octal > string literals 0"012 345 670" *might* be worth considering. Most situations where octal could be useful died out many decades ago - it is vastly more likely that "012" is intended to mean 12 than 10. No serious programming language supports a leading 0 as an indication of octal unless they are forced to do so by backwards compatibility, and many that used to support them have dropped them. Having /some/ way to write octal can be helpful to old *nix programmers who prefer 046 to "S_IRUSR | S_IWUSR | S_IRGRP" in their chmod calls. (And to be fair, the constant names made in ancient history with short identifier length limits are pretty ugly.) But it is not something to be encouraged, and I think there is no simple syntax that is obviously octal, and not easily mistaken for something else. > <https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3193.htm> > proposes a new "0o123" syntax for octal constants; if that's adopted, > I propose allowing 0o"..." and *not" 0"...". I'm not sure whether > to suggest hex only, or doing hex, octal, and binary for the sake > of completeness. Binary support is useless, and octal support would be worse than useless - even using an 0o rather than 0 prefix. Completeness is not a justification for repeating old mistakes or complicating a good idea with features that will never be used. > > What I'm trying to design here is a more straightforward way to > represent raw (unsigned char[]) data in C code, largely but not > exclusively for use by #embed. > Personally, I'd see it as useful when /not/ using #embed. I really do not think programmers will care what format #embed uses. I don't share your concerns about efficiency of implementation, or that programmers need to know when it is efficient or not. In almost all circumstances, C programmers never see or need to think about a separation between a C preprocessor and a post-processed C compiler - they are seen as a single entity, and can use whatever format is convenient between them. And once you ignore the implementation details, which are an SEP, the way #embed is defined is better than a definition using these new hex blob strings. But I have seen situations where it is useful to have embedded blobs directly in the source file, and then a compact solution would be convenient. Currently most people use a list of hex constants, either byte for byte or sometimes in larger units, and hex strings like this would make it neater and more convenient. (Attempts to use current string literals for the purpose look more like corruption in the file than source code.)