Deutsch English Français Italiano |
<v9iq5k$hhhs$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thiago Adams <thiago.adams@gmail.com> Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 14:40:04 -0300 Organization: A noiseless patient Spider Lines: 217 Message-ID: <v9iq5k$hhhs$1@dont-email.me> References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me> <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me> <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me> <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me> <v9io8c$h8v8$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 19:40:05 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262"; logging-data="575036"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eQjkSNshF1oxkA//pN1DYDM4WZEsHztY=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:lH/hFcJi6yVfCl5qZx2UjSq35+8= Content-Language: en-US In-Reply-To: <v9io8c$h8v8$1@dont-email.me> Bytes: 7015 On 14/08/2024 14:07, Bart wrote: > On 14/08/2024 17:10, Thiago Adams wrote: >> On 14/08/2024 12:34, Bart wrote: > >>> In that case I don't understand what you are testing for here. Is it >>> an error for '×' to be 215, or an error for it not to be? >> >> >> GCC handles this as multibyte. Without decoding. >> >> The result of GCC is 50071 >> static_assert('×' == 50071); >> >> The explanation is that GCC is doing: >> >> 256*195 + 151 = 50071 > > So the 50071 is the 2-byte UTF8 sequence. 50071 is the result of multiplying the first byte 195*256 and adding the second byte 151. (This is NOT UTF8 related, this is the way C compilers generates the value) On the other hand, DECODING, bytes 195 and 151 using UTF8 gives us the result of 215, that is the unicode value. > > >> (Remember the utf8 bytes were 195 151) >> >> The way 'ab' is handled is the same of '×' on GCC. > > I don't understand. 'a' and 'b' each occupy one byte. Together they need > two bytes. > Where's the problem? Are you perhaps confused as to what UTF8 is? I am not confused. The problem is that the value of 'ab' is not defined in C. So I want to use this but it is a warning. > > The 50071 above is much better expressed as hex: C397, which is two > bytes. Since both values are in 128..255, they are UTF8 codes, here > expressing a single Unicode character. I am using '==' etc.. to represent token numbers. > Given any two bytes in UTF8, it is easy to see whether they are two > ASCII character, or one (or part of) a Unicode characters, or one ASCII > character followed by the first byte of a UTF8 sequence, or if they are > malformed (eg. the middle of a UTF8 sequence). > > There is no confusion. > > > >>> And what is the test for, to ensure encoding is UTF8 in this ... >>> source file? ... compiler? >> >> MSVC has some checks, I don't know that is the logic. >> >> >>> Where would the 'decoded 215' come into it? >> >> 215 is the value after decoding utf8 and producing the unicode value. > > Who or what does that, and for what purpose? From what I've seen, only > you have introduced it. ? Any modern language will make '×' as 215 (the unicode value). But these languages don't allow multi chars like 'ab'. New languages are like U'×' in C. >> So my suggestion is decode first. > > Why? What are you comparing? Both sides of == must use UTF8 or Unicode, > but why introduce Unicode at all if apparently everything in source code > and at compile time, as you yourself have stated, is UTF8? > >> The bad part of my suggestion we may have two different ways of >> producing the same value. >> >> For instance the number generated by ab is the same of >> >> 'ab' == '𤤰' > > I don't think so. If I run this program: > > #include <stdio.h> > #include <string.h> > > int main() { > printf("%u\n", '×'); > printf("%04X\n", '×'); > printf("%u\n", 'ab'); > printf("%04X\n", 'ab'); > printf("%u\n", '𤤰'); > printf("%04X\n", '𤤰'); > } This is not running the algorithm I am suggesting!This 'ab' == '𤤰' happens only in the say I am suggesting. No compiler is doing that today. (I never imagined this would cause such confusion in understanding) > > I get this output (I've left out the decimal versions for clarity): > > C397 × > > 6162 ab > > F0A4A4B0 𤤰 > > That Chinese ideogram occupies 4 bytes. It is impossible for 'ab' to > clash with some other Unicode character. > > My suggestion again. I am using string but imagine this working with bytes from file. #include <stdio.h> #include <assert.h> const unsigned char* utf8_decode(const unsigned char* s, int* c) { if (s[0] == '\0') { *c = 0; return NULL; /*end*/ } const unsigned char* next = NULL; if (s[0] < 0x80) { *c = s[0]; assert(*c >= 0x0000 && *c <= 0x007F); next = s + 1; } else if ((s[0] & 0xe0) == 0xc0) { *c = ((int)(s[0] & 0x1f) << 6) | ((int)(s[1] & 0x3f) << 0); assert(*c >= 0x0080 && *c <= 0x07FF); next = s + 2; } else if ((s[0] & 0xf0) == 0xe0) { *c = ((int)(s[0] & 0x0f) << 12) | ((int)(s[1] & 0x3f) << 6) | ((int)(s[2] & 0x3f) << 0); assert(*c >= 0x0800 && *c <= 0xFFFF); next = s + 3; } else if ((s[0] & 0xf8) == 0xf0 && (s[0] <= 0xf4)) { *c = ((int)(s[0] & 0x07) << 18) | ((int)(s[1] & 0x3f) << 12) | ((int)(s[2] & 0x3f) << 6) | ((int)(s[3] & 0x3f) << 0); assert(*c >= 0x10000 && *c <= 0x10FFFF); next = s + 4; } else { *c = -1; // invalid ========== REMAINDER OF ARTICLE TRUNCATED ==========