Deutsch English Français Italiano |
<v9isvq$i0fs$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thiago Adams <thiago.adams@gmail.com> Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 15:28:10 -0300 Organization: A noiseless patient Spider Lines: 70 Message-ID: <v9isvq$i0fs$1@dont-email.me> References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me> <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me> <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me> <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me> <v9io8c$h8v8$1@dont-email.me> <v9iq5k$hhhs$1@dont-email.me> <v9is2h$i0sd$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 20:28:11 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262"; logging-data="590332"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fFkbiKD7t4Yp7k4Bkk3aQeU55DB3rXls=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:8GR52a0Mp9Nfa2F27eXRQXrnsi4= Content-Language: en-US In-Reply-To: <v9is2h$i0sd$1@dont-email.me> Bytes: 3309 On 14/08/2024 15:12, Bart wrote: > On 14/08/2024 18:40, Thiago Adams wrote: >> On 14/08/2024 14:07, Bart wrote: > >>> That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to >>> clash with some other Unicode character. >>> >>> >> >> My suggestion again. I am using string but imagine this working with >> bytes from file. >> >> >> #include <stdio.h> >> #include <assert.h> > > ... >> int get_value(const char* s0) >> { >> const char * s = s0; >> int value = 0; >> int uc; >> s = utf8_decode(s, &uc); >> while (s) >> { >> if (uc < 0x007F) >> { >> //multichar formula >> value = value*256+uc; >> } >> else >> { >> //single char >> value = uc; >> break; //check if there is more then error.. >> } >> s = utf8_decode(s, &uc); >> } >> return value; >> } >> >> int main(){ >> printf("%d\n", get_value(u8"×")); >> printf("%d\n", get_value(u8"ab")); >> } > > I see your problem. You're mixing things up. The objective is : - make single characters have the Unicode value without having to use U'' - allow more than one chars like 'ab' in some cases where each character is less than 0x007F. This can break code for instance '¼¼'. but I am suspecting people are not using in this way (I hope) > gcc will combine BYTE values together (by shifting by 8 bits or > multiplying by 256), including the individual bytes that represent UTF8. > > You are combining ONLY ASCII bytes, and comparing the results with > 21-bit Unicode values. > > That is meaningless. I'm not surprised you get a clash between A*256+B, > and some arbitrary Unicode index. > In any case..my suggestion looks dangerous. But meanwhile this is not well specified in the standard.