| Deutsch English Français Italiano |
|
<1ffb2244967a28423c968f4b4a9fec5a2553f356@i2pn2.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: news.eternal-september.org!eternal-september.org!news.eternal-september.org!feeder3.eternal-september.org!news.quux.org!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: Richard Damon <richard@damon-family.org>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Tue, 13 Aug 2024 23:44:24 -0400
Organization: i2pn2 (i2pn.org)
Message-ID: <1ffb2244967a28423c968f4b4a9fec5a2553f356@i2pn2.org>
References: <v9frim$3u7qi$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 03:44:24 -0000 (UTC)
Injection-Info: i2pn2.org;
logging-data="2503679"; mail-complaints-to="usenet@i2pn2.org";
posting-account="diqKR1lalukngNWEqoq9/uFtbkm5U+w3w6FQ0yesrXg";
User-Agent: Mozilla Thunderbird
Content-Language: en-US
In-Reply-To: <v9frim$3u7qi$1@dont-email.me>
X-Spam-Checker-Version: SpamAssassin 4.0.0
On 8/13/24 10:45 AM, Thiago Adams wrote:
> static_assert('×' == 50071);
>
> GCC - warning multi byte
> CLANG - error character too large
>
> I think instead of "multi bytes" we need "multi characters" - not bytes.
>
> We decode utf8 then we have the character to decide if it is multi char
> or not.
>
> decoding '×' would consume bytes 195 and 151 the result is the decoded
> Unicode value of 215.
>
> It is not multi byte : 256*195 + 151 = 50071
>
> O the other hand 'ab' is "multi character" resulting
>
> 256 * 'a' + 'b' = 256*97+98= 24930
>
> One consequence is that
>
> 'ab' == '𤤰'
>
> But I don't think this is a problem. At least everything is defined.
When you use the single quotes by themselves ('), you are specifying
characters in the narrow character set, typically ASCII, but might be
some other 8-bit character encoding. It can not specify extended
character beyond those.
You can (if the implementation allows it) place multiple characters in
the constant to get an integer value with those characters packed.
When you use the double quotes by themselves ("), you are specifying a
string of these narrow characters, although this form might allow for
multi-byte encodings of some characters, like is done with UTF-8.
You can specifiy wide character constants by the syntax of L'x', u'x',
or U'x'.
L'x' will give you what ever the inplementation calls its "wide
character set". This MIGHT be UCS-2/UTF-16 or UCS-4/UTF-32 encoded, but
doesn't need to be.
The u'x' form will always be USC-2/UTF-16, and U'x' will always be
UCS-4/UTF-32
Like the plain 'x' form, the results from a single character, can not be
a multi-unit value, so u'x' can't generate a two surrogate pairs for a
single source character.
Change the ' to a " and you get wide strings, just like the characters,
but now u"xx" and L"xx" can generate charaters that use surrogate pairs
(or other multi-part encodings for L"xxx")