Article <v4ota0$ii30$1@dont-email.me>

Deutsch English Français Italiano
<v4ota0$ii30$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: C23 thoughts and opinions
Date: Mon, 17 Jun 2024 10:49:04 +0200
Organization: A noiseless patient Spider
Lines: 220
Message-ID: <v4ota0$ii30$1@dont-email.me>
References: <v2l828$18v7f$1@dont-email.me>
 <00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
 <v2lji1$1bbcp$1@dont-email.me> <87msoh5uh6.fsf@nosuchdomain.example.com>
 <f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
 <v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
 <87y18047jk.fsf@nosuchdomain.example.com>
 <87msoe1xxo.fsf@nosuchdomain.example.com> <v2sh19$2rle2$2@dont-email.me>
 <87ikz11osy.fsf@nosuchdomain.example.com> <v2v59g$3cr0f$1@dont-email.me>
 <87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
 <87cyp6zsen.fsf@nosuchdomain.example.com> <v34gi3$j385$1@dont-email.me>
 <874jahznzt.fsf@nosuchdomain.example.com> <v36nf9$12bei$1@dont-email.me>
 <87v82b43h6.fsf@nosuchdomain.example.com> <v4igql$32qts$1@dont-email.me>
 <v4kib3$3icus$1@dont-email.me> <v4kpvc$3jrmr$1@dont-email.me>
 <v4mubu$3jg8$1@dont-email.me> <v4ncor$66d3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 17 Jun 2024 10:49:05 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="6eee6dfb82180fb756db1a7758fc4b5a";
	logging-data="608352"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18f9p5u9dVnV+pJvrWBVS5iwkDG2wX/djI="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:Xt2Q0iHzlg0et/Gst5OlIjEApsg=
Content-Language: en-GB
In-Reply-To: <v4ncor$66d3$1@dont-email.me>
Bytes: 10948

On 16/06/2024 21:00, bart wrote:
> On 16/06/2024 15:54, David Brown wrote:
>> On 15/06/2024 21:27, bart wrote:
>>> On 15/06/2024 18:17, David Brown wrote:
>>>> On 15/06/2024 00:39, bart wrote:
>>>>> On 14/06/2024 22:30, Keith Thompson wrote:
>>>>>
>>>>>> Now that it's too late to change the definition, I've thought of
>>>>>> something that I think would have been a better way to specify 
>>>>>> #embed.
>>>>>>
>>>>>> Define a new kind of string literal, with a "uc" prefix.  
>>>>>> `uc"foo"` is
>>>>>> of type `unsigned char[3]`.  (Or `const unsigned char[3]`, if 
>>>>>> that's not
>>>>>> too radical.)  Unlike other string literals, there is no implicit
>>>>>> terminating '\0'.  Arbitrary byte values can of course be 
>>>>>> specified in
>>>>>> hexadecimal: uc"\x01\x02\x03\x04".  Since there's no terminating null
>>>>>> character and C doesn't support zero-sized objects, uc"" is a syntax
>>>>>> error.
>>>>>>
>>>>>> uc"..." string literals might be made even simpler, for example 
>>>>>> allowing
>>>>>> only hex digits and not requiring \x (uc"01020304" rather than
>>>>>> uc"\x01\x02\x03\x04").  That's probably overkill.  uc"..."  literals
>>>>>> could be useful in other contexts, and programmers will want
>>>>>> flexibility.  Maybe something like hex"01020304" (embedded spaces 
>>>>>> could
>>>>>> be ignored) could be defined in addition to uc"\x01\x02\x03\x04".
>>>>>
>>>>> That's something I added to string literals in my language within 
>>>>> the last few months. Nothing do with embedding (but it can make hex 
>>>>> sequences within strings more efficient, if that approach was used).
>>>>>
>>>>> Writing byte-at-a-time hex data was always a bit fiddly:
>>>>>
>>>>>      0x12, 0x34, 0xAB, ...
>>>>>      "\x12\x34\xAB...
>>>>>
>>>>> It was made worse by my preference for `x` being in lower case, and 
>>>>> the hex digits in upper case, otherwise 0XABC or 0Xab or 0xab look 
>>>>> wrong.
>>>>>
>>>>> What I did was create a new, variable-lenghth string escape 
>>>>> sequence that looks like this:
>>>>>
>>>>>    "ABC\h1234AB...\nopq"     // hex sequence between ABC & nopq
>>>>>
>>>>> Hex digits after \h or \H are read in pairs. White space is allowed 
>>>>> between pairs:
>>>>>
>>>>>    "ABC\H 12 34 AB ...\nopq"
>>>>>
>>>>> The only thing I wasn't sure about was the closing backslash, which 
>>>>> looks at first like another escape code. But I think it is sound, 
>>>>> although it can still be tweaked.
>>>>>
>>>>>
>>>>
>>>> How often would something like that be useful?  I would have thought 
>>>> that it is rare to see something that is basically text but has 
>>>> enough odd non-printing characters (other than the common \n, \t, 
>>>> \e) to make it worth the fuss.  If you want to have binary data in 
>>>> something that looks like a string literal, then just use 
>>>> straight-up two hex digits per character - "4142431234ab".  It's 
>>>> simpler to generate and parse. I don't see the benefit of something 
>>>> that mixes binary and text data.
>>>
>>> That's not the same thing. That sequence "...1234..." occupies 4 
>>> bytes (with values 49 50 51 52), not two bytes (with values 0x12 and 
>>> 0x34, or 18 and 52).
>>>
>>> Here's an example of wanting to print '€4.99', first in C (note that 
>>> my editor doesn't support Unicode so this stuff is needed):
>>>
>>>     puts("\xE2\x82\xAC" "4.99");
>>>
>>> The euro symbol occupies three bytes in UTF8. It's awkward to type: 
>>> it has loads of backslashes, it keeps switching case and it needs 
>>> more concentration.
>>>
>>> Plus I had to split the string since apparently \x doesn't stop at 
>>> two hex digits, it keeps going: it would have read \xAC4, which 
>>> overflows the 8-bit width of a character anyway, so I don't know what 
>>> the point is of reading more than 2 hex characters.
>>>
>>> Using my feature, it looks like this:
>>>
>>>      println "\H E2 82 AC\4.99"
>>>
>>
>> I don't see any improvement of significance.  The improvement, if any, 
>> is very minor.
> 
> The difference is that it can be typed fluently without that annoying \x 
> between every number. Plus I can add white space for grouping without it 
> affecting the data.
> 

I realise you think your system is much nicer - otherwise you would not 
have implemented it!  /I/ don't think it is a big improvement.  It is 
certainly not big enough to be worth the effort of changing real 
languages or tools used by lots of people rather than just a single 
person.  And I think the termination using "\" is a step backwards - now 
"\" is no longer an escape character, but has different purposes in 
different places.  One and a half steps forward, one step back, is not 
worth the effort - especially when you can so easily go several steps 
forward with the format I suggested.

> 
>> (I gather you have other conveniences for your language's printing 
>> features when converting various types, but that's a different matter.)
>>
>> The obvious answer to writing this kind of thing is simply to switch 
>> to an editor that supports UTF-8.
> 
> It never happens that you want to type a bunch of hex byte values to 
> initialise a byte array? OK.

It /does/ happen.  In such cases, I type a bunch of hex values.

What doesn't happen is that I have a UTF-8 text and I choose to write 
that using hex values.  I much prefer to write the UTF-8 text using an 
editor that supports UTF-8 and tools that work with UTF-8.

> 
>> Why bother with the \H stuff?  That's my point - use hex data for 
>> data, and text for text.  Mixing these is not common enough to make it 
>> worth the extra fuss you have to give such negligible extra convenience.
>>
>> My suggestion is that it could be helpful to have binary blobs written 
>> as hex digits without escapes anywhere, because it is /just/ binary 
>> data.  I don't object to having optional spaces - that's a fine idea. 
>> But just write :
>>
>>      b"4D 5A 90 00 03 00 00 00 04 00 00 00 FF FF 00 00"
>>      b"B8 00 00 00 00 00 00 00 40 00 00 00 00 00 00 00"
>>
>> The extra "\H" adds nothing useful.
> 
> Is this a separate feature using 'b'? 

Yes - that's the point.  It would be for expressing binary blob data in 
a compact form as a string of hex digits, with or without spaces, and 
convenient for copy-and-paste from hex editors and other such sources. 
You could happily use h"..." rather than b"..." if you prefer.  And I 
suppose it could be extended to support lumps bigger than 8 bits, but 
then endian issues complicate matters and I suspect it is not worth the 
effort.

> Because in my scheme, \H is just 
> another string escape code, which can be used in ordinary strings, 

That is what I would want to avoid.  Being able to mix such data is a 
disadvantage, not an advantage.  (IMHO, of course.)

> and 
> b"" strings define char[] data which can include normal text data too.
> 
> So my example could have been written as b"MZ\h 90 00 03 ..."

And that kind of monstrosity is what I was trying to get away from.

> 
========== REMAINDER OF ARTICLE TRUNCATED ==========