Article <v9i54d$e3c6$1@dont-email.me>

Deutsch English Français Italiano
<v9i54d$e3c6$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thiago Adams <thiago.adams@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 08:41:01 -0300
Organization: A noiseless patient Spider
Lines: 83
Message-ID: <v9i54d$e3c6$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 13:41:02 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262";
	logging-data="462214"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/ViQ6P4j3fP3OhfC26EPROFXasPZsvj0s="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:NI/x+SdLhvSo/w8188pKNhTknmY=
In-Reply-To: <87sev8eydx.fsf@nosuchdomain.example.com>
Content-Language: en-US
Bytes: 3581

On 13/08/2024 21:33, Keith Thompson wrote:
> Bart<bc@freeuk.com>  writes:
> [...]
>> What exactly do you mean by multi-byte characters? Is it a literal
>> such as 'ABCD'?
>>
>> I've no idea what C makes of that,
> It's a character constant of type int with an implementation-defined
> value.  Read the section on "Character constants" in the C standard
> (6.4.4.4 in C17).
> 
> (With gcc, its value is 0x41424344, but other compilers can and do
> behave differently.)
> 
> We discussed this at some length several years ago.
> 
> [...]


"An integer character constant has type int. The value of an integer 
character constant containing
a single character that maps to a single value in the literal encoding 
(6.2.9) is the numerical value
of the representation of the mapped character in the literal encoding 
interpreted as an integer.
The value of an integer character constant containing more than one 
character (e.g. ’ab’), or
containing a character or escape sequence that does not map to a single 
value in the literal encoding,
is implementation-defined. If an integer character constant contains a 
single character or escape
sequence, its value is the one that results when an object with type 
char whose value is that of the
single character or escape sequence is converted to type int."


I am suggesting the define this:

"The value of an integer character constant containing more than one 
character (e.g. ’ab’), or containing a character or escape sequence that 
does not map to a single value in the literal encoding, is 
implementation-defined."

How?

First, all source code should be utf8.

Then I am suggesting we first decode the bytes.

For instance, '×' is encoded with 195 and 151. We consume these 2 bytes 
and the utf8 decoded value is 215.

Then this is the defined behavior

static_assert('×' == 215)

In case we have 'ab' for instance:
Fist we decode 'a' 97 then 'b' 98. We consume one byte each.
Then we have two characters. In this case we do

256 * 'a' + 'b' = 256*97+98= 24930

static_assert('ab' == 24930)

I believe this static_assert('ab' == 24930) matches the way it is used 
today.

In case the value is bigger than MAX_INT I think it should be unsigned int.

Why?

Adding fixes on top of fixes make the language bigger and complex.
Like adding U'' L'' u8'' etc.

In my source code I use only utf8, everything just works without any 
u8"" etc.