Path: ...!weretis.net!feeder9.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: bart Newsgroups: comp.lang.c Subject: Re: C23 thoughts and opinions Date: Sat, 15 Jun 2024 20:27:41 +0100 Organization: A noiseless patient Spider Lines: 114 Message-ID: References: <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com> <87msoh5uh6.fsf@nosuchdomain.example.com> <87y18047jk.fsf@nosuchdomain.example.com> <87msoe1xxo.fsf@nosuchdomain.example.com> <87ikz11osy.fsf@nosuchdomain.example.com> <87plt8yxgn.fsf@nosuchdomain.example.com> <87cyp6zsen.fsf@nosuchdomain.example.com> <874jahznzt.fsf@nosuchdomain.example.com> <87v82b43h6.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sat, 15 Jun 2024 21:27:41 +0200 (CEST) Injection-Info: dont-email.me; posting-host="70a9fc796b84cb08329413872ec51cfa"; logging-data="3796699"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/NJweA8LntS5YICSRX9I3j" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:0dKBbdEPeM4RBKxQNTFr78n5FN0= In-Reply-To: Content-Language: en-GB Bytes: 6955 On 15/06/2024 18:17, David Brown wrote: > On 15/06/2024 00:39, bart wrote: >> On 14/06/2024 22:30, Keith Thompson wrote: >> >>> Now that it's too late to change the definition, I've thought of >>> something that I think would have been a better way to specify #embed. >>> >>> Define a new kind of string literal, with a "uc" prefix.  `uc"foo"` is >>> of type `unsigned char[3]`.  (Or `const unsigned char[3]`, if that's not >>> too radical.)  Unlike other string literals, there is no implicit >>> terminating '\0'.  Arbitrary byte values can of course be specified in >>> hexadecimal: uc"\x01\x02\x03\x04".  Since there's no terminating null >>> character and C doesn't support zero-sized objects, uc"" is a syntax >>> error. >>> >>> uc"..." string literals might be made even simpler, for example allowing >>> only hex digits and not requiring \x (uc"01020304" rather than >>> uc"\x01\x02\x03\x04").  That's probably overkill.  uc"..."  literals >>> could be useful in other contexts, and programmers will want >>> flexibility.  Maybe something like hex"01020304" (embedded spaces could >>> be ignored) could be defined in addition to uc"\x01\x02\x03\x04". >> >> That's something I added to string literals in my language within the >> last few months. Nothing do with embedding (but it can make hex >> sequences within strings more efficient, if that approach was used). >> >> Writing byte-at-a-time hex data was always a bit fiddly: >> >>      0x12, 0x34, 0xAB, ... >>      "\x12\x34\xAB... >> >> It was made worse by my preference for `x` being in lower case, and >> the hex digits in upper case, otherwise 0XABC or 0Xab or 0xab look wrong. >> >> What I did was create a new, variable-lenghth string escape sequence >> that looks like this: >> >>    "ABC\h1234AB...\nopq"     // hex sequence between ABC & nopq >> >> Hex digits after \h or \H are read in pairs. White space is allowed >> between pairs: >> >>    "ABC\H 12 34 AB ...\nopq" >> >> The only thing I wasn't sure about was the closing backslash, which >> looks at first like another escape code. But I think it is sound, >> although it can still be tweaked. >> >> > > How often would something like that be useful?  I would have thought > that it is rare to see something that is basically text but has enough > odd non-printing characters (other than the common \n, \t, \e) to make > it worth the fuss.  If you want to have binary data in something that > looks like a string literal, then just use straight-up two hex digits > per character - "4142431234ab".  It's simpler to generate and parse.  I > don't see the benefit of something that mixes binary and text data. That's not the same thing. That sequence "...1234..." occupies 4 bytes (with values 49 50 51 52), not two bytes (with values 0x12 and 0x34, or 18 and 52). Here's an example of wanting to print '€4.99', first in C (note that my editor doesn't support Unicode so this stuff is needed): puts("\xE2\x82\xAC" "4.99"); The euro symbol occupies three bytes in UTF8. It's awkward to type: it has loads of backslashes, it keeps switching case and it needs more concentration. Plus I had to split the string since apparently \x doesn't stop at two hex digits, it keeps going: it would have read \xAC4, which overflows the 8-bit width of a character anyway, so I don't know what the point is of reading more than 2 hex characters. Using my feature, it looks like this: println "\H E2 82 AC\4.99" There must be loads of examples of wanting to write many byte values within strings, which in C can also be used to initialise byte arrays (a useful feature I've now adopted; see below). Here's another example, in my language, which is the first 128 bytes of an EXE file which is constant. It is currently defined like this, probably created with a script: []byte stubdata = ( 0x4D, 0x5A, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00, 0x04, 0x00, 0x00, 0x00, 0xFF, 0xFF, 0x00, 0x00, ... Using the new escape, I can just copy&paste a dump, and use a text editor to put in the string context needed, which took under a minute: []byte stubdata= b"\H 4D 5A 90 00 03 00 00 00 04 00 00 00 FF FF 00 00\"+ b"\H B8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00\"+ b"\H 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00\"+ b"\H 00 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00\"+ b"\H 0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68\"+ b"\H 69 73 20 70 72 6F 67 72 61 6D 20 63 61 6E 6E 6F\"+ b"\H 74 20 62 65 20 72 75 6E 20 69 6E 20 44 4F 53 20\"+ b"\H 6D 6F 64 65 2E 0D 0D 0A 24 00 00 00 00 00 00 00\"+ b"\H 50 45 00 00 64 86 04 00 00 00 00 00 00 00 00 00\" (The 's'/'b' prefixes are needed for strings to have a type of (in C terms) char[] rather than char*, a detail that C glosses over via some magic. 's' gives you a zero terminator, 'b' as used here doesn't. The "+" is used for compile-time string/data-string concatenation.) In short, more is possible without needed to resort to tools. You can directly work from a hex dump.