Article <v9isvq$i0fs$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v9isvq$i0fs$1@dont-email.me>

Deutsch English Français Italiano

<v9isvq$i0fs$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Thiago Adams <thiago.adams@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 15:28:10 -0300
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <v9isvq$i0fs$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me>
 <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me>
 <v9io8c$h8v8$1@dont-email.me> <v9iq5k$hhhs$1@dont-email.me>
 <v9is2h$i0sd$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 20:28:11 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262";
	logging-data="590332"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/fFkbiKD7t4Yp7k4Bkk3aQeU55DB3rXls="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:8GR52a0Mp9Nfa2F27eXRQXrnsi4=
Content-Language: en-US
In-Reply-To: <v9is2h$i0sd$1@dont-email.me>
Bytes: 3309

On 14/08/2024 15:12, Bart wrote:
> On 14/08/2024 18:40, Thiago Adams wrote:
>> On 14/08/2024 14:07, Bart wrote:
> 
>>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
>>> clash with some other Unicode character.
>>>
>>>
>>
>> My suggestion again. I am using string but imagine this working with 
>> bytes from file.
>>
>>
>> #include <stdio.h>
>> #include <assert.h>
> 
> ...
>> int get_value(const char* s0)
>> {
>>     const char * s = s0;
>>     int value = 0;
>>     int  uc;
>>     s = utf8_decode(s, &uc);
>>     while (s)
>>     {
>>       if (uc < 0x007F)
>>       {
>>          //multichar formula
>>          value = value*256+uc;
>>       }
>>       else
>>       {
>>          //single char
>>          value = uc;
>>          break; //check if there is more then error..
>>       }
>>       s = utf8_decode(s, &uc);
>>     }
>>     return value;
>> }
>>
>> int main(){
>>    printf("%d\n", get_value(u8"×"));
>>    printf("%d\n", get_value(u8"ab"));
>> }
> 
> I see your problem. You're mixing things up.


The objective is :
  - make single characters have the Unicode value without  having to use U''
  - allow more than one chars like 'ab' in some cases where each 
character is less than 0x007F. This can break code for instance '¼¼'.
but I am suspecting people are not using in this way (I hope)

> gcc will combine BYTE values together (by shifting by 8 bits or 
> multiplying by 256), including the individual bytes that represent UTF8.
> 
> You are combining ONLY ASCII bytes, and comparing the results with 
> 21-bit Unicode values.
> 
> That is meaningless. I'm not surprised you get a clash between A*256+B, 
> and some arbitrary Unicode index.
> 

In any case..my suggestion looks dangerous. But meanwhile this is not 
well specified in the standard.