Deutsch English Français Italiano |
<v9ilte$gvc8$2@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thiago Adams <thiago.adams@gmail.com> Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 13:27:26 -0300 Organization: A noiseless patient Spider Lines: 143 Message-ID: <v9ilte$gvc8$2@dont-email.me> References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me> <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me> <v9ia2i$f3p2$1@dont-email.me> <v9ibkf$e3c6$2@dont-email.me> <v9iipe$gl5i$1@dont-email.me> <v9iksr$gvc8$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 18:27:26 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262"; logging-data="556424"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19KFyZ+GGikcmEjd2AEkHcy5XUZzDVoavM=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:S96yiDNfP6mvE6WC/oHpSUxJwNI= In-Reply-To: <v9iksr$gvc8$1@dont-email.me> Content-Language: en-US Bytes: 5838 On 14/08/2024 13:10, Thiago Adams wrote: > On 14/08/2024 12:34, Bart wrote: >> On 14/08/2024 14:31, Thiago Adams wrote: >>> On 14/08/2024 10:05, Bart wrote: >>>> On 14/08/2024 12:41, Thiago Adams wrote: >>>>> On 13/08/2024 21:33, Keith Thompson wrote: >>>>>> Bart<bc@freeuk.com> writes: >>>>>> [...] >>>>>>> What exactly do you mean by multi-byte characters? Is it a literal >>>>>>> such as 'ABCD'? >>>>>>> >>>>>>> I've no idea what C makes of that, >>>>>> It's a character constant of type int with an implementation-defined >>>>>> value. Read the section on "Character constants" in the C standard >>>>>> (6.4.4.4 in C17). >>>>>> >>>>>> (With gcc, its value is 0x41424344, but other compilers can and do >>>>>> behave differently.) >>>>>> >>>>>> We discussed this at some length several years ago. >>>>>> >>>>>> [...] >>>>> >>>>> >>>>> "An integer character constant has type int. The value of an >>>>> integer character constant containing >>>>> a single character that maps to a single value in the literal >>>>> encoding (6.2.9) is the numerical value >>>>> of the representation of the mapped character in the literal >>>>> encoding interpreted as an integer. >>>>> The value of an integer character constant containing more than one >>>>> character (e.g. ’ab’), or >>>>> containing a character or escape sequence that does not map to a >>>>> single value in the literal encoding, >>>>> is implementation-defined. If an integer character constant >>>>> contains a single character or escape >>>>> sequence, its value is the one that results when an object with >>>>> type char whose value is that of the >>>>> single character or escape sequence is converted to type int." >>>>> >>>>> >>>>> I am suggesting the define this: >>>>> >>>>> "The value of an integer character constant containing more than >>>>> one character (e.g. ’ab’), or containing a character or escape >>>>> sequence that does not map to a single value in the literal >>>>> encoding, is implementation-defined." >>>>> >>>>> How? >>>>> >>>>> First, all source code should be utf8. >>>>> >>>>> Then I am suggesting we first decode the bytes. >>>>> >>>>> For instance, '×' is encoded with 195 and 151. We consume these 2 >>>>> bytes and the utf8 decoded value is 215. >>>> >>>> By that you mean the Unicode index. But you say elsewhere that >>>> everything in your source code is UTF8. >>> >>> >>> 215 is the unicode number of the character '×'. >>> >>>> Where then does the 215 appear? Do your char* strings use 215 for ×, >>>> or do they use 195 and 215? >>> >>> 215 is the result of decoding two utf8 encoded bytes. (195 and 151) >>> >>>> I think this is why C requires those prefixes like u8'...'. >>> >>>>> >>>>> Then this is the defined behavior >>>>> >>>>> static_assert('×' == 215) >>>> >>>> This is where you need to decide whether the integer value within >>>> '...', AT RUNTIME, represents the Unicode index or the UTF8 sequence. >>> >>> why runtime? It is compile time. This is why source code must be >>> universally encoded (utf8) >> >> >> In that case I don't understand what you are testing for here. Is it >> an error for '×' to be 215, or an error for it not to be? > > > GCC handles this as multibyte. Without decoding. > > The result of GCC is 50071 > static_assert('×' == 50071); > > The explanation is that GCC is doing: > > 256*195 + 151 = 50071 > > (Remember the utf8 bytes were 195 151) > > The way 'ab' is handled is the same of '×' on GCC. Clang have a error > for that. The standard just says the value is implementation defined. > >> And what is the test for, to ensure encoding is UTF8 in this ... >> source file? ... compiler? > > MSVC has some checks, I don't know that is the logic. > > >> Where would the 'decoded 215' come into it? > > 215 is the value after decoding utf8 and producing the unicode value. > > So my suggestion is decode first. > > The bad part of my suggestion we may have two different ways of > producing the same value. > > For instance the number generated by ab is the same of > > 'ab' == '𤤰' > > The advantage is to converge to utf8 unicode and make it specified. > > > I use multibyte chars in my code. For instance: enum token {TK_EQUAL == '=='} I prefer to write and read token.type == '==' rather than token.type = TK_EQUAL. An alternative for me also could be a macro. if (token.type = MC('=', '=')) {...} but then its worst than the type = TK_EQUAL