Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: Hex string literals (was Re: C23 thoughts and opinions) Date: Mon, 17 Jun 2024 18:57:09 -0700 Organization: None to speak of Lines: 88 Message-ID: <878qz31096.fsf@nosuchdomain.example.com> References: <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com> <87msoh5uh6.fsf@nosuchdomain.example.com> <87y18047jk.fsf@nosuchdomain.example.com> <87msoe1xxo.fsf@nosuchdomain.example.com> <87ikz11osy.fsf@nosuchdomain.example.com> <87plt8yxgn.fsf@nosuchdomain.example.com> <87cyp6zsen.fsf@nosuchdomain.example.com> <874jahznzt.fsf@nosuchdomain.example.com> <87v82b43h6.fsf@nosuchdomain.example.com> <87iky830v7.fsf_-_@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Tue, 18 Jun 2024 03:57:15 +0200 (CEST) Injection-Info: dont-email.me; posting-host="aed299878570cb32e21d076f9aa05b90"; logging-data="1211818"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/m6ojX3SH/AXGAm7CTox2H" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux) Cancel-Lock: sha1:z/twOvHQaQNhT80wRDiROi7dpNU= sha1:KP6/g+9AkgZ7wkx2PUPImFiWkYM= Bytes: 5706 Richard Kettlewell writes: > Keith Thompson writes: >> Inspired by the existing syntax for integer and floating-point >> hex constants, I propose using a "0x" prefix. 0x"deadbeef" is an >> expression of type `const unsigned char[4]` (assuming CHAR_BIT==8), >> with values 0xde, 0xad, 0xbe, 0xef in that order. Byte order is >> irrelevant; we're specifying byte values in order, not bytes of >> the representation of some larger type. memcpy()ing 0x"deadbeef" >> to a uint32 might yield either 0xdeadbeef or uxefbeadde (or other >> more exotic possibilities). > > I like the syntax and I’d find it useful. > > There’s more to life than byte arrays, though, so I wonder if there’s > more to be said here. I find myself dealing a lot with large integers > generally represented as arrays of some unsigned type (commonly uint32_t > but other possibilities arise too). > > In C as it stands today this requires a translation step before > constants can be embedded in source code (which is error-prone if > someone attempts to do it manually). > > So being able to say ‘0x8732456872648956348596893765836543 as array of > uint64_t, LSW first’ (in some suitably C-like syntax) would be a big > improvement from my perspective, primarily as an accelerator to > development but also as a small improvement in robustness. You could use some kind of type punning. For example, this is currently legal: union { unsigned char buf[4]; uint32_t n; } obj = { .buf = { 0x01, 0x02, 0x03, 0x04 } }; The { 0x01, 0x02, 0x03, 0x04 } could be replaced with 0x"01020304". Of course you have to deal with endianness. Since C defines representation in terms of arrays of unsigned char, I'm inclined to stick to just that. If there's a *clean* way to extend it to wider types, I'm ok with that (and I'm not the one who needs to be convinced). >> Again, unlike other string literals, there is no implicit terminating >> null byte. And I suggest making them const, since there's no >> existing code to break. >> >> If CHAR_BIT==8, each byte is represented by two hex digits. More >> generally, each byte is represented by (CHAR_BIT+3)/4 hex digits in >> the absence of whitespace. Added whitespace marks the end of a byte, >> 0x"deadbeef" is 1, 2, 3, or 4 bytes if CHAR_BIT is 32, 16, 12, or 8 >> respectively, but 0x"de ad be ef" is 4 bytes regardless of CHAR_BIT. >> 0x"" is a syntax error, since C doesn't support zero-length arrays. >> Anything between the quotes other than hex digits and spaces is a >> syntax error. > > Would "0x1 23 45 67" be a syntax error or { 0x1, 0x23, 0x45, 0x67 }? The latter. As you acknowledge in a followup, the 0x goes outside the quotation marks. The end of a byte is indicated either by having the right number of hex digits (2 if CHAR_BIT==8, more otherwise) or by a space character. 0x"1 23 45 67" would be equivalent to 0x"01 23 45 67", or to "0x01234567" if CHAR_BIT==8. >> What I'm trying to design here is a more straightforward way to >> represent raw (unsigned char[]) data in C code, largely but not >> exclusively for use by #embed. > > Compilers can already implement #embed however they like, there’s no > need for a standardized way to represent the ‘inside’ of a #embed. The (draft) standard already specifies what #embed expands to, a comma-delimited sequence of integer constant expressions. Compilers must implement it in a way that yields the same behavior, including in contrived cases like the struct example I posted before. A compiler might implement it in a more optimal manner if it knows that the target object is an array of unsigned char. -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com void Void(void) { Void(); } /* The recursive call of the void */