Article <v4mubu$3jg8$1@dont-email.me>

Deutsch English Français Italiano
<v4mubu$3jg8$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: C23 thoughts and opinions
Date: Sun, 16 Jun 2024 16:54:53 +0200
Organization: A noiseless patient Spider
Lines: 148
Message-ID: <v4mubu$3jg8$1@dont-email.me>
References: <v2l828$18v7f$1@dont-email.me>
 <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
 <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com>
 <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
 <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
 <87y18047jk.fsf@nosuchdomain.example.com>
 <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me>
 <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me>
 <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
 <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me>
 <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me>
 <87v82b43h6.fsf@nosuchdomain.example.com> <v4igql$32qts$1@dont-email.me>
 <v4kib3$3icus$1@dont-email.me> <v4kpvc$3jrmr$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 16 Jun 2024 16:54:54 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="efa9642fc1579e8fdfcdda80d78f3954";
	logging-data="118280"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19G64dEOhs5POI487GPW4iLCM1y5K301PE="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:UugKSYUowlRoywc72DUr17sSPUg=
In-Reply-To: <v4kpvc$3jrmr$1@dont-email.me>
Content-Language: en-GB
Bytes: 8260

On 15/06/2024 21:27, bart wrote:
> On 15/06/2024 18:17, David Brown wrote:
>> On 15/06/2024 00:39, bart wrote:
>>> On 14/06/2024 22:30, Keith Thompson wrote:
>>>
>>>> Now that it's too late to change the definition, I've thought of
>>>> something that I think would have been a better way to specify #embed.
>>>>
>>>> Define a new kind of string literal, with a "uc" prefix.  `uc"foo"` is
>>>> of type `unsigned char[3]`.  (Or `const unsigned char[3]`, if that's 
>>>> not
>>>> too radical.)  Unlike other string literals, there is no implicit
>>>> terminating '\0'.  Arbitrary byte values can of course be specified in
>>>> hexadecimal: uc"\x01\x02\x03\x04".  Since there's no terminating null
>>>> character and C doesn't support zero-sized objects, uc"" is a syntax
>>>> error.
>>>>
>>>> uc"..." string literals might be made even simpler, for example 
>>>> allowing
>>>> only hex digits and not requiring \x (uc"01020304" rather than
>>>> uc"\x01\x02\x03\x04").  That's probably overkill.  uc"..."  literals
>>>> could be useful in other contexts, and programmers will want
>>>> flexibility.  Maybe something like hex"01020304" (embedded spaces could
>>>> be ignored) could be defined in addition to uc"\x01\x02\x03\x04".
>>>
>>> That's something I added to string literals in my language within the 
>>> last few months. Nothing do with embedding (but it can make hex 
>>> sequences within strings more efficient, if that approach was used).
>>>
>>> Writing byte-at-a-time hex data was always a bit fiddly:
>>>
>>>      0x12, 0x34, 0xAB, ...
>>>      "\x12\x34\xAB...
>>>
>>> It was made worse by my preference for `x` being in lower case, and 
>>> the hex digits in upper case, otherwise 0XABC or 0Xab or 0xab look 
>>> wrong.
>>>
>>> What I did was create a new, variable-lenghth string escape sequence 
>>> that looks like this:
>>>
>>>    "ABC\h1234AB...\nopq"     // hex sequence between ABC & nopq
>>>
>>> Hex digits after \h or \H are read in pairs. White space is allowed 
>>> between pairs:
>>>
>>>    "ABC\H 12 34 AB ...\nopq"
>>>
>>> The only thing I wasn't sure about was the closing backslash, which 
>>> looks at first like another escape code. But I think it is sound, 
>>> although it can still be tweaked.
>>>
>>>
>>
>> How often would something like that be useful?  I would have thought 
>> that it is rare to see something that is basically text but has enough 
>> odd non-printing characters (other than the common \n, \t, \e) to make 
>> it worth the fuss.  If you want to have binary data in something that 
>> looks like a string literal, then just use straight-up two hex digits 
>> per character - "4142431234ab".  It's simpler to generate and parse.  
>> I don't see the benefit of something that mixes binary and text data.
> 
> That's not the same thing. That sequence "...1234..." occupies 4 bytes 
> (with values 49 50 51 52), not two bytes (with values 0x12 and 0x34, or 
> 18 and 52).
> 
> Here's an example of wanting to print '€4.99', first in C (note that my 
> editor doesn't support Unicode so this stuff is needed):
> 
>     puts("\xE2\x82\xAC" "4.99");
> 
> The euro symbol occupies three bytes in UTF8. It's awkward to type: it 
> has loads of backslashes, it keeps switching case and it needs more 
> concentration.
> 
> Plus I had to split the string since apparently \x doesn't stop at two 
> hex digits, it keeps going: it would have read \xAC4, which overflows 
> the 8-bit width of a character anyway, so I don't know what the point is 
> of reading more than 2 hex characters.
> 
> Using my feature, it looks like this:
> 
>      println "\H E2 82 AC\4.99"
> 

I don't see any improvement of significance.  The improvement, if any, 
is very minor.

(I gather you have other conveniences for your language's printing 
features when converting various types, but that's a different matter.)

The obvious answer to writing this kind of thing is simply to switch to 
an editor that supports UTF-8.  That has been the obvious answer for a 
couple of decades.

> There must be loads of examples of wanting to write many byte values 
> within strings, which in C can also be used to initialise byte arrays (a 
> useful feature I've now adopted; see below).
> 
> Here's another example, in my language, which is the first 128 bytes of 
> an EXE file which is constant. It is currently defined like this, 
> probably created with a script:
> 
>    []byte stubdata = (
>      0x4D, 0x5A, 0x90, 0x00, 0x03, 0x00, 0x00, 0x00,
>      0x04, 0x00, 0x00, 0x00, 0xFF, 0xFF, 0x00, 0x00,
>      ...
> 
> Using the new escape, I can just copy&paste a dump, and use a text 
> editor to put in the string context needed, which took under a minute:
> 
> []byte stubdata=
>    b"\H 4D 5A 90 00 03 00 00 00 04 00 00 00 FF FF 00 00\"+
>    b"\H B8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00\"+
>    b"\H 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00\"+
>    b"\H 00 00 00 00 00 00 00 00 00 00 00 00 80 00 00 00\"+
>    b"\H 0E 1F BA 0E 00 B4 09 CD 21 B8 01 4C CD 21 54 68\"+
>    b"\H 69 73 20 70 72 6F 67 72 61 6D 20 63 61 6E 6E 6F\"+
>    b"\H 74 20 62 65 20 72 75 6E 20 69 6E 20 44 4F 53 20\"+
>    b"\H 6D 6F 64 65 2E 0D 0D 0A 24 00 00 00 00 00 00 00\"+
>    b"\H 50 45 00 00 64 86 04 00 00 00 00 00 00 00 00 00\"

Why bother with the \H stuff?  That's my point - use hex data for data, 
and text for text.  Mixing these is not common enough to make it worth 
the extra fuss you have to give such negligible extra convenience.

My suggestion is that it could be helpful to have binary blobs written 
as hex digits without escapes anywhere, because it is /just/ binary 
data.  I don't object to having optional spaces - that's a fine idea. 
But just write :

     b"4D 5A 90 00 03 00 00 00 04 00 00 00 FF FF 00 00"
     b"B8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00"

The extra "\H" adds nothing useful.




> 
> (The 's'/'b' prefixes are needed for strings to have a type of (in C 
> terms) char[] rather than char*, a detail that C glosses over via some 
> magic. 's' gives you a zero terminator, 'b' as used here doesn't. The 
> "+" is used for compile-time string/data-string concatenation.)
> 
> In short, more is possible without needed to resort to tools. You can 
> directly work from a hex dump.