Path: ...!feeds.phibee-telecom.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Ben Bacarisse Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 01:32:14 +0100 Organization: A noiseless patient Spider Lines: 60 Message-ID: <874j7ot04x.fsf@bsb.me.uk> References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 02:32:17 +0200 (CEST) Injection-Info: dont-email.me; posting-host="391ed58e04ae16ab54318399024a4f06"; logging-data="155870"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/EdO/F0CQjqIMteO5lAugcTc2Is5yW0T4=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:ULk0GFJHf7irs+sxwTp1Q5A0hS0= sha1:rxCUXDeNjBSk9O1zbv64zbLD/rM= X-BSB-Auth: 1.8bd5252a9130369ee21e.20240814013214BST.874j7ot04x.fsf@bsb.me.uk Bytes: 2907 Thiago Adams writes: > static_assert('×' == 50071); static_assert(U'×' == 215); works, but then I don't know what you were trying to do. > GCC - warning multi byte > CLANG - error character too large > > I think instead of "multi bytes" we need "multi characters" - not > bytes. > > We decode utf8 then we have the character to decide if it is multi char or > not. These terms can be confusing and I don't know exactly how you are using them. Basically I simply don't know what that second sentence is saying. > decoding '×' would consume bytes 195 and 151 the result is the decoded > Unicode value of 215. Yes, Unicode 215 is UTF-8 encoded as two bytes with values 195 and 151. > It is not multi byte : 256*195 + 151 = 50071 If that × is UTF-8 encoded then it might look, to the compiler, just like an old-fashioned multi-character character constant just like 'ab' does. Then again, it might not. gcc and clan take different views on the matter. You can get clang to that the same view a gcc by writing static_assert('\xC3\x97' == 50071); instead. Now both gcc and clang see it for what it is: an old-fashioned multi-character character constant. > O the other hand 'ab' is "multi character" resulting The term for these things used to be "multi-byte character constant" and they were highly non-portable. The trouble is that the term "multi-byte character" now refers to highly portable encodings like UTF-8. Maybe that's why gcc seems to have changed it's warning from what you gave to: warning: multi-character character constant [-Wmultichar] > 256 * 'a' + 'b' = 256*97+98= 24930 > > One consequence is that > > 'ab' == '𤤰' > > But I don't think this is a problem. At least everything is defined. > -- Ben.