Article <v9ilte$gvc8$2@dont-email.me>

Deutsch English Français Italiano
<v9ilte$gvc8$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thiago Adams <thiago.adams@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 13:27:26 -0300
Organization: A noiseless patient Spider
Lines: 143
Message-ID: <v9ilte$gvc8$2@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me>
 <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 18:27:26 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262";
	logging-data="556424"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19KFyZ+GGikcmEjd2AEkHcy5XUZzDVoavM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:S96yiDNfP6mvE6WC/oHpSUxJwNI=
In-Reply-To: <v9iksr$gvc8$1@dont-email.me>
Content-Language: en-US
Bytes: 5838

On 14/08/2024 13:10, Thiago Adams wrote:
> On 14/08/2024 12:34, Bart wrote:
>> On 14/08/2024 14:31, Thiago Adams wrote:
>>> On 14/08/2024 10:05, Bart wrote:
>>>> On 14/08/2024 12:41, Thiago Adams wrote:
>>>>> On 13/08/2024 21:33, Keith Thompson wrote:
>>>>>> Bart<bc@freeuk.com>  writes:
>>>>>> [...]
>>>>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>>>>> such as 'ABCD'?
>>>>>>>
>>>>>>> I've no idea what C makes of that,
>>>>>> It's a character constant of type int with an implementation-defined
>>>>>> value.  Read the section on "Character constants" in the C standard
>>>>>> (6.4.4.4 in C17).
>>>>>>
>>>>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>>>>> behave differently.)
>>>>>>
>>>>>> We discussed this at some length several years ago.
>>>>>>
>>>>>> [...]
>>>>>
>>>>>
>>>>> "An integer character constant has type int. The value of an 
>>>>> integer character constant containing
>>>>> a single character that maps to a single value in the literal 
>>>>> encoding (6.2.9) is the numerical value
>>>>> of the representation of the mapped character in the literal 
>>>>> encoding interpreted as an integer.
>>>>> The value of an integer character constant containing more than one 
>>>>> character (e.g. ’ab’), or
>>>>> containing a character or escape sequence that does not map to a 
>>>>> single value in the literal encoding,
>>>>> is implementation-defined. If an integer character constant 
>>>>> contains a single character or escape
>>>>> sequence, its value is the one that results when an object with 
>>>>> type char whose value is that of the
>>>>> single character or escape sequence is converted to type int."
>>>>>
>>>>>
>>>>> I am suggesting the define this:
>>>>>
>>>>> "The value of an integer character constant containing more than 
>>>>> one character (e.g. ’ab’), or containing a character or escape 
>>>>> sequence that does not map to a single value in the literal 
>>>>> encoding, is implementation-defined."
>>>>>
>>>>> How?
>>>>>
>>>>> First, all source code should be utf8.
>>>>>
>>>>> Then I am suggesting we first decode the bytes.
>>>>>
>>>>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>>>>> bytes and the utf8 decoded value is 215.
>>>>
>>>> By that you mean the Unicode index. But you say elsewhere that 
>>>> everything in your source code is UTF8.
>>>
>>>
>>> 215 is the unicode number of the character '×'.
>>>
>>>> Where then does the 215 appear? Do your char* strings use 215 for ×, 
>>>> or do they use 195 and 215?
>>>
>>> 215 is the result of decoding two utf8 encoded bytes. (195 and 151)
>>>
>>>> I think this is why C requires those prefixes like u8'...'.
>>>
>>>>>
>>>>> Then this is the defined behavior
>>>>>
>>>>> static_assert('×' == 215)
>>>>
>>>> This is where you need to decide whether the integer value within 
>>>> '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence.
>>>
>>> why runtime? It is compile time. This is why source code must be 
>>> universally encoded (utf8)
>>
>>
>> In that case I don't understand what you are testing for here. Is it 
>> an error for '×' to be 215, or an error for it not to be?
> 
> 
> GCC handles this as multibyte. Without decoding.
> 
> The result of GCC is 50071
> static_assert('×' == 50071);
> 
> The explanation is that GCC is doing:
> 
> 256*195 + 151 = 50071
> 
> (Remember the utf8 bytes were 195 151)
> 
> The way 'ab' is handled is the same of '×' on GCC. Clang have a error 
> for that. The standard just says the value is implementation defined.
> 
>> And what is the test for, to ensure encoding is UTF8 in this ... 
>> source file? ... compiler?
> 
> MSVC has some checks, I don't know that is the logic.
> 
> 
>> Where would the 'decoded 215' come into it?
> 
> 215 is the value after decoding utf8 and producing the unicode value.
> 
> So my suggestion is decode first.
> 
> The bad part of my suggestion we may have two different ways of 
> producing the same value.
> 
> For instance the number generated by ab is the same of
> 
> 'ab' == '𤤰'
> 
> The advantage is to converge to utf8 unicode and make it specified.
> 
> 
> 

I use multibyte chars in my code.

For instance:
enum token {TK_EQUAL == '=='}

I prefer to write and read token.type == '==' rather than
token.type = TK_EQUAL.

An alternative for me also could be a macro.

if (token.type = MC('=', '=')) {...}

but then its worst than the type = TK_EQUAL