Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bart <bc@freeuk.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 18:07:26 +0100
Organization: A noiseless patient Spider
Lines: 96
Message-ID: <v9io8c$h8v8$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me>
 <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 19:07:25 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="876cd8a29adc81823b5bb07945ef8107";
	logging-data="566248"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18r1LLuKpe5US9/x7ORk/KL"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:tp8wlB3wPuAa0tcdf1J+0ECiUu8=
Content-Language: en-GB
In-Reply-To: <v9iksr$gvc8$1@dont-email.me>
Bytes: 3806

On 14/08/2024 17:10, Thiago Adams wrote:
> On 14/08/2024 12:34, Bart wrote:

>> In that case I don't understand what you are testing for here. Is it 
>> an error for '×' to be 215, or an error for it not to be?
> 
> 
> GCC handles this as multibyte. Without decoding.
> 
> The result of GCC is 50071
> static_assert('×' == 50071);
> 
> The explanation is that GCC is doing:
> 
> 256*195 + 151 = 50071

So the 50071 is the 2-byte UTF8 sequence.



> (Remember the utf8 bytes were 195 151)
> 
> The way 'ab' is handled is the same of '×' on GCC.

I don't understand. 'a' and 'b' each occupy one byte. Together they need 
two bytes.

Where's the problem? Are you perhaps confused as to what UTF8 is?

The 50071 above is much better expressed as hex: C397, which is two 
bytes. Since both values are in 128..255, they are UTF8 codes, here 
expressing a single Unicode character.

Given any two bytes in UTF8, it is easy to see whether they are two 
ASCII character, or one (or part of) a Unicode characters, or one ASCII 
character followed by the first byte of a UTF8 sequence, or if they are 
malformed (eg. the middle of a UTF8 sequence).

There is no confusion.



>> And what is the test for, to ensure encoding is UTF8 in this ... 
>> source file? ... compiler?
> 
> MSVC has some checks, I don't know that is the logic.
> 
> 
>> Where would the 'decoded 215' come into it?
> 
> 215 is the value after decoding utf8 and producing the unicode value.

Who or what does that, and for what purpose? From what I've seen, only 
you have introduced it.

> So my suggestion is decode first.

Why? What are you comparing? Both sides of == must use UTF8 or Unicode, 
but why introduce Unicode at all if apparently everything in source code 
and at compile time, as you yourself have stated, is UTF8?

> The bad part of my suggestion we may have two different ways of 
> producing the same value.
> 
> For instance the number generated by ab is the same of
> 
> 'ab' == '𤤰'

I don't think so. If I run this program:

  #include <stdio.h>
  #include <string.h>

  int main() {
    printf("%u\n", '×');
    printf("%04X\n", '×');
    printf("%u\n", 'ab');
    printf("%04X\n", 'ab');
    printf("%u\n", '𤤰');
    printf("%04X\n", '𤤰');
  }


I get this output (I've left out the decimal versions for clarity):

C397                ×

6162                ab

F0A4A4B0            𤤰

That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
clash with some other Unicode character.