Path: ...!feeds.phibee-telecom.net!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bart <bc@freeuk.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 00:52:13 +0100
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <v9grjd$4cjd$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 01:52:13 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="876cd8a29adc81823b5bb07945ef8107";
	logging-data="143981"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/hHSRSq7IfRLwyFQ4ubkBc"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xQFgngaZUJcVvvsLcir9cQ5Biu0=
Content-Language: en-GB
In-Reply-To: <v9frim$3u7qi$1@dont-email.me>
Bytes: 3711

On 13/08/2024 15:45, Thiago Adams wrote:
> static_assert('×' == 50071);
> 
> GCC -  warning multi byte
> CLANG - error character too large
> 
> I think instead of "multi bytes" we need "multi characters" - not bytes.
> 
> We decode utf8 then we have the character to decide if it is multi char 
> or not.
> 
> decoding '×' would consume bytes 195 and 151 the result is the decoded 
> Unicode value of 215.
> 
> It is not multi byte : 256*195 + 151 = 50071
> 
> O the other hand 'ab' is "multi character" resulting
> 
> 256 * 'a' + 'b' = 256*97+98= 24930
> 
> One consequence is that
> 
> 'ab' == '𤤰'
> 
> But I don't think this is a problem. At least everything is defined.

What exactly do you mean by multi-byte characters? Is it a literal such 
as 'ABCD'?

I've no idea what C makes of that, so you will first have to specify 
what it might represent:

* Is it a single character represented by multiple bytes?

* If so, do those multiple bytes specify a Unicode number (2-3 bytes), 
or a UTF8 sequence (up to 4 bytes, maybe more)?

* If those multiple sequence are allowed, could you have more than one 
mixed ASCII/Unicode/UTF8 characters?

One problem with UTF8 in C character literals is that I believe those 
are limited to an 'int' type, so 32 bits. You can't fit much in there. 
And once you have such a value, how do you print it?

Some of this you can take care of in your 'cake' product, and 
superimpose a particular spec on top of C (maybe they can be extended to 
64 bits) but you probably can't do much about 'printf'.

(In my language, I overhauled this part of it earlier this year. There 
it works like this:

* Character literals can be 64 bits

* They can represent up to 8 ASCII characters: 'ABCDEFGH'

* They can include escape codes for both Unicode and UTF8, and multiple
   such characters can be specified:

    'A\u20ACB'            # All represent A€B; this is Unicode
    'A\h EC 82 AC\B'      # This is UTF8
    'A\xEC\x82\xACB'      # C-style escape

   Internally they are stored as UTF8, so the 20AC is converted to UTF8

* The ordering of the characters matches that of the equivalent
   "A\e20ACB" string when stored in memory; but this applies only to
   little-endian

* Print routines have options to print the first character (which can be
   a Unicode one), or the whole sequence)

Another aspect is when typing Unicode text directly via your text editor 
instead of using escape codes; will the C source be UTF8, or some other 
encoding? This will affect how the text is represented, and how much you 
can fit into one 32/64-bit literal.