Article <v9ia2i$f3p2$1@dont-email.me>

Deutsch English Français Italiano
<v9ia2i$f3p2$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bart <bc@freeuk.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 14:05:22 +0100
Organization: A noiseless patient Spider
Lines: 79
Message-ID: <v9ia2i$f3p2$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 15:05:22 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="876cd8a29adc81823b5bb07945ef8107";
	logging-data="495394"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/mynwvivECta4DCA1TzNxQ"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:BqkwnsBGLPp0HeAdCm6lxIcnodU=
Content-Language: en-GB
In-Reply-To: <v9i54d$e3c6$1@dont-email.me>
Bytes: 3986

On 14/08/2024 12:41, Thiago Adams wrote:
> On 13/08/2024 21:33, Keith Thompson wrote:
>> Bart<bc@freeuk.com>  writes:
>> [...]
>>> What exactly do you mean by multi-byte characters? Is it a literal
>>> such as 'ABCD'?
>>>
>>> I've no idea what C makes of that,
>> It's a character constant of type int with an implementation-defined
>> value.  Read the section on "Character constants" in the C standard
>> (6.4.4.4 in C17).
>>
>> (With gcc, its value is 0x41424344, but other compilers can and do
>> behave differently.)
>>
>> We discussed this at some length several years ago.
>>
>> [...]
> 
> 
> "An integer character constant has type int. The value of an integer 
> character constant containing
> a single character that maps to a single value in the literal encoding 
> (6.2.9) is the numerical value
> of the representation of the mapped character in the literal encoding 
> interpreted as an integer.
> The value of an integer character constant containing more than one 
> character (e.g. ’ab’), or
> containing a character or escape sequence that does not map to a single 
> value in the literal encoding,
> is implementation-defined. If an integer character constant contains a 
> single character or escape
> sequence, its value is the one that results when an object with type 
> char whose value is that of the
> single character or escape sequence is converted to type int."
> 
> 
> I am suggesting the define this:
> 
> "The value of an integer character constant containing more than one 
> character (e.g. ’ab’), or containing a character or escape sequence that 
> does not map to a single value in the literal encoding, is 
> implementation-defined."
> 
> How?
> 
> First, all source code should be utf8.
> 
> Then I am suggesting we first decode the bytes.
> 
> For instance, '×' is encoded with 195 and 151. We consume these 2 bytes 
> and the utf8 decoded value is 215.

By that you mean the Unicode index. But you say elsewhere that 
everything in your source code is UTF8.

Where then does the 215 appear? Do your char* strings use 215 for ×, or 
do they use 195 and 215?

I think this is why C requires those prefixes like u8'...'.

> 
> Then this is the defined behavior
> 
> static_assert('×' == 215)

This is where you need to decide whether the integer value within '...', 
AT RUNTIME, represents the Unicode index or the UTF8 sequence.

(In my language, though I do very little with Unicode ATM, I decided 
that everything is UTF8 both at compile time and runtime. Unless I 
explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either 
will work), which contains 21-bit Unicode index values.)

I get the impression that C's wide characters are intended for those 
Unicode indices, but that's not going to work well on Windows with its 
16-bit wide character type.