Article <v9iq5k$hhhs$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v9iq5k$hhhs$1@dont-email.me>
Deutsch English Français Italiano
<v9iq5k$hhhs$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thiago Adams <thiago.adams@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 14:40:04 -0300
Organization: A noiseless patient Spider
Lines: 217
Message-ID: <v9iq5k$hhhs$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me>
 <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me>
 <v9io8c$h8v8$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 19:40:05 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262";
	logging-data="575036"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+eQjkSNshF1oxkA//pN1DYDM4WZEsHztY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:lH/hFcJi6yVfCl5qZx2UjSq35+8=
Content-Language: en-US
In-Reply-To: <v9io8c$h8v8$1@dont-email.me>
Bytes: 7015

On 14/08/2024 14:07, Bart wrote:
> On 14/08/2024 17:10, Thiago Adams wrote:
>> On 14/08/2024 12:34, Bart wrote:
> 
>>> In that case I don't understand what you are testing for here. Is it 
>>> an error for '×' to be 215, or an error for it not to be?
>>
>>
>> GCC handles this as multibyte. Without decoding.
>>
>> The result of GCC is 50071
>> static_assert('×' == 50071);
>>
>> The explanation is that GCC is doing:
>>
>> 256*195 + 151 = 50071
> 
> So the 50071 is the 2-byte UTF8 sequence.

50071 is the result of multiplying the first byte 195*256 and adding the 
second byte 151. (This is NOT UTF8 related, this is the way C compilers 
generates the value)

On the other hand, DECODING, bytes 195 and 151 using UTF8 gives us the 
result of 215, that is the unicode value.


> 
> 
>> (Remember the utf8 bytes were 195 151)
>>
>> The way 'ab' is handled is the same of '×' on GCC.
> 
> I don't understand. 'a' and 'b' each occupy one byte. Together they need 
> two bytes.
> Where's the problem? Are you perhaps confused as to what UTF8 is?

I am not confused.

The problem is that the value of 'ab' is not defined in C. So I want to 
use this but it is a warning.


> 
> The 50071 above is much better expressed as hex: C397, which is two 
> bytes. Since both values are in 128..255, they are UTF8 codes, here 
> expressing a single Unicode character.


I am using '==' etc.. to represent token numbers.


> Given any two bytes in UTF8, it is easy to see whether they are two 
> ASCII character, or one (or part of) a Unicode characters, or one ASCII 
> character followed by the first byte of a UTF8 sequence, or if they are 
> malformed (eg. the middle of a UTF8 sequence).
> 
> There is no confusion.
> 
> 
> 
>>> And what is the test for, to ensure encoding is UTF8 in this ... 
>>> source file? ... compiler?
>>
>> MSVC has some checks, I don't know that is the logic.
>>
>>
>>> Where would the 'decoded 215' come into it?
>>
>> 215 is the value after decoding utf8 and producing the unicode value.
> 
> Who or what does that, and for what purpose? From what I've seen, only 
> you have introduced it.

?
Any modern language will make '×' as 215 (the unicode value). But these 
languages don't allow multi chars like 'ab'.
New languages are like U'×' in C.

>> So my suggestion is decode first.
> 
> Why? What are you comparing? Both sides of == must use UTF8 or Unicode, 
> but why introduce Unicode at all if apparently everything in source code 
> and at compile time, as you yourself have stated, is UTF8?
> 
>> The bad part of my suggestion we may have two different ways of 
>> producing the same value.
>>
>> For instance the number generated by ab is the same of
>>
>> 'ab' == '𤤰'
> 
> I don't think so. If I run this program:
> 
>   #include <stdio.h>
>   #include <string.h>
> 
>   int main() {
>     printf("%u\n", '×');
>     printf("%04X\n", '×');
>     printf("%u\n", 'ab');
>     printf("%04X\n", 'ab');
>     printf("%u\n", '𤤰');
>     printf("%04X\n", '𤤰');
>   }

This is not running the algorithm I am suggesting!This 'ab' == '𤤰' 
happens only in the say I am suggesting. No compiler is doing that today.
(I never imagined this would cause such confusion in understanding)



> 
> I get this output (I've left out the decimal versions for clarity):
> 
> C397                ×
> 
> 6162                ab
> 
> F0A4A4B0            𤤰
> 
> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
> clash with some other Unicode character.
> 
> 

My suggestion again. I am using string but imagine this working with 
bytes from file.


#include <stdio.h>
#include <assert.h>

const unsigned char* utf8_decode(const unsigned char* s, int* c)
{
     if (s[0] == '\0')
     {
         *c = 0;
         return NULL; /*end*/
     }

     const unsigned char*  next = NULL;
     if (s[0] < 0x80)
     {
         *c = s[0];
         assert(*c >= 0x0000 && *c <= 0x007F);
         next = s + 1;
     }
     else if ((s[0] & 0xe0) == 0xc0)
     {
         *c = ((int)(s[0] & 0x1f) << 6) |
             ((int)(s[1] & 0x3f) << 0);
         assert(*c >= 0x0080 && *c <= 0x07FF);
         next = s + 2;
     }
     else if ((s[0] & 0xf0) == 0xe0)
     {
         *c = ((int)(s[0] & 0x0f) << 12) |
             ((int)(s[1] & 0x3f) << 6) |
             ((int)(s[2] & 0x3f) << 0);
         assert(*c >= 0x0800 && *c <= 0xFFFF);
         next = s + 3;
     }
     else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4))
     {
         *c = ((int)(s[0] & 0x07) << 18) |
             ((int)(s[1] & 0x3f) << 12) |
             ((int)(s[2] & 0x3f) << 6) |
             ((int)(s[3] & 0x3f) << 0);
         assert(*c >= 0x10000 && *c <= 0x10FFFF);
         next = s + 4;
     }
     else
     {
         *c = -1;      // invalid
========== REMAINDER OF ARTICLE TRUNCATED ==========