Deutsch   English   Français   Italiano  
<v9j0oe$in82$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Bart <bc@freeuk.com>
Newsgroups: comp.lang.c
Subject: Re: multi bytes character - how to make it defined behavior?
Date: Wed, 14 Aug 2024 20:32:31 +0100
Organization: A noiseless patient Spider
Lines: 78
Message-ID: <v9j0oe$in82$1@dont-email.me>
References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me>
 <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me>
 <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me>
 <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me>
 <v9io8c$h8v8$1@dont-email.me> <v9iq5k$hhhs$1@dont-email.me>
 <v9is2h$i0sd$1@dont-email.me> <v9isvq$i0fs$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 14 Aug 2024 21:32:31 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="876cd8a29adc81823b5bb07945ef8107";
	logging-data="613634"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18+HE9ORPqz/aoXLzFEIaA/"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:3EVwt2qEEl5fEKKxLwfLWeDL8D4=
Content-Language: en-GB
In-Reply-To: <v9isvq$i0fs$1@dont-email.me>
Bytes: 3929

On 14/08/2024 19:28, Thiago Adams wrote:
> On 14/08/2024 15:12, Bart wrote:
>> On 14/08/2024 18:40, Thiago Adams wrote:
>>> On 14/08/2024 14:07, Bart wrote:
>>
>>>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to 
>>>> clash with some other Unicode character.
>>>>
>>>>
>>>
>>> My suggestion again. I am using string but imagine this working with 
>>> bytes from file.
>>>
>>>
>>> #include <stdio.h>
>>> #include <assert.h>
>>
>> ...
>>> int get_value(const char* s0)
>>> {
>>>     const char * s = s0;
>>>     int value = 0;
>>>     int  uc;
>>>     s = utf8_decode(s, &uc);
>>>     while (s)
>>>     {
>>>       if (uc < 0x007F)
>>>       {
>>>          //multichar formula
>>>          value = value*256+uc;
>>>       }
>>>       else
>>>       {
>>>          //single char
>>>          value = uc;
>>>          break; //check if there is more then error..
>>>       }
>>>       s = utf8_decode(s, &uc);
>>>     }
>>>     return value;
>>> }
>>>
>>> int main(){
>>>    printf("%d\n", get_value(u8"×"));
>>>    printf("%d\n", get_value(u8"ab"));
>>> }
>>
>> I see your problem. You're mixing things up.
> 
> 
> The objective is :
>   - make single characters have the Unicode value without  having to use 
> U''
>   - allow more than one chars like 'ab' in some cases where each 
> character is less than 0x007F. This can break code for instance '¼¼'.
> but I am suspecting people are not using in this way (I hope)

Obviously that can't work, for example because two printable ASCII 
characters with codes 32 to 96, will have values from 1024 to 9216 when 
combined in a character literal. Those are going to clash with Unicode 
characters with those values.

It won't work either at compile-time or runtime.

You need to choose between Unicode representation and UTF8. Either that 
or use some prefix to disambiguate in source code, but you still need 
decide whether '€' in source code is represented as the Unicode bytes 20 
AC (or maybe 00 20 AC) or the UTF8 sequence EC 82 AC, and further decide 
which end of those sequences will be the least signfificant byte.


> In any case..my suggestion looks dangerous. But meanwhile this is not 
> well specified in the standard.

It wasn't well-specified even when dealing with 100% ASCII. For example, 
'AB' might have the hex value 0x4142 on one compiler, 0x4241 on another, 
maybe just 0x41 or 0x42 on a third, or even 0x41410000.