Article <v9ibkf$e3c6$2@dont-email.me>

Deutsch English Français Italiano
<v9ibkf$e3c6$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thiago Adams <thiago.adams@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 10:31:59 -0300
Organization: A noiseless patient Spider
Lines: 93
Message-ID: <v9ibkf$e3c6$2@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 15:32:00 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262";
	logging-data="462214"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19nlG8fOeh86azFllu2mOdse9NLVn6OnmY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:beJ2Eyqv4AUy9L1E9E+cDRYBtsg=
In-Reply-To: <v9ia2i$f3p2$1@dont-email.me>
Content-Language: en-US
Bytes: 4484

On 14/08/2024 10:05, Bart wrote:
> On 14/08/2024 12:41, Thiago Adams wrote:
>> On 13/08/2024 21:33, Keith Thompson wrote:
>>> Bart<bc@freeuk.com>  writes:
>>> [...]
>>>> What exactly do you mean by multi-byte characters? Is it a literal
>>>> such as 'ABCD'?
>>>>
>>>> I've no idea what C makes of that,
>>> It's a character constant of type int with an implementation-defined
>>> value.  Read the section on "Character constants" in the C standard
>>> (6.4.4.4 in C17).
>>>
>>> (With gcc, its value is 0x41424344, but other compilers can and do
>>> behave differently.)
>>>
>>> We discussed this at some length several years ago.
>>>
>>> [...]
>>
>>
>> "An integer character constant has type int. The value of an integer 
>> character constant containing
>> a single character that maps to a single value in the literal encoding 
>> (6.2.9) is the numerical value
>> of the representation of the mapped character in the literal encoding 
>> interpreted as an integer.
>> The value of an integer character constant containing more than one 
>> character (e.g. ’ab’), or
>> containing a character or escape sequence that does not map to a 
>> single value in the literal encoding,
>> is implementation-defined. If an integer character constant contains a 
>> single character or escape
>> sequence, its value is the one that results when an object with type 
>> char whose value is that of the
>> single character or escape sequence is converted to type int."
>>
>>
>> I am suggesting the define this:
>>
>> "The value of an integer character constant containing more than one 
>> character (e.g. ’ab’), or containing a character or escape sequence 
>> that does not map to a single value in the literal encoding, is 
>> implementation-defined."
>>
>> How?
>>
>> First, all source code should be utf8.
>>
>> Then I am suggesting we first decode the bytes.
>>
>> For instance, '×' is encoded with 195 and 151. We consume these 2 
>> bytes and the utf8 decoded value is 215.
> 
> By that you mean the Unicode index. But you say elsewhere that 
> everything in your source code is UTF8.


215 is the unicode number of the character '×'.

> Where then does the 215 appear? Do your char* strings use 215 for ×, or 
> do they use 195 and 215?

215 is the result of decoding two utf8 encoded bytes. (195 and 151)

> I think this is why C requires those prefixes like u8'...'.

>>
>> Then this is the defined behavior
>>
>> static_assert('×' == 215)
> 
> This is where you need to decide whether the integer value within '...', 
> AT RUNTIME, represents the Unicode index or the UTF8 sequence.

why runtime? It is compile time. This is why source code must be 
universally encoded (utf8)


> (In my language, though I do very little with Unicode ATM, I decided 
> that everything is UTF8 both at compile time and runtime. Unless I 
> explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either 
> will work), which contains 21-bit Unicode index values.)
> 
> I get the impression that C's wide characters are intended for those 
> Unicode indices, but that's not going to work well on Windows with its 
> 16-bit wide character type.
> 

nowadays wide is just for windows API compatibility.