Path: ...!feeds.phibee-telecom.net!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Bart Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 00:52:13 +0100 Organization: A noiseless patient Spider Lines: 77 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 01:52:13 +0200 (CEST) Injection-Info: dont-email.me; posting-host="876cd8a29adc81823b5bb07945ef8107"; logging-data="143981"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/hHSRSq7IfRLwyFQ4ubkBc" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:xQFgngaZUJcVvvsLcir9cQ5Biu0= Content-Language: en-GB In-Reply-To: Bytes: 3711 On 13/08/2024 15:45, Thiago Adams wrote: > static_assert('×' == 50071); > > GCC -  warning multi byte > CLANG - error character too large > > I think instead of "multi bytes" we need "multi characters" - not bytes. > > We decode utf8 then we have the character to decide if it is multi char > or not. > > decoding '×' would consume bytes 195 and 151 the result is the decoded > Unicode value of 215. > > It is not multi byte : 256*195 + 151 = 50071 > > O the other hand 'ab' is "multi character" resulting > > 256 * 'a' + 'b' = 256*97+98= 24930 > > One consequence is that > > 'ab' == '𤤰' > > But I don't think this is a problem. At least everything is defined. What exactly do you mean by multi-byte characters? Is it a literal such as 'ABCD'? I've no idea what C makes of that, so you will first have to specify what it might represent: * Is it a single character represented by multiple bytes? * If so, do those multiple bytes specify a Unicode number (2-3 bytes), or a UTF8 sequence (up to 4 bytes, maybe more)? * If those multiple sequence are allowed, could you have more than one mixed ASCII/Unicode/UTF8 characters? One problem with UTF8 in C character literals is that I believe those are limited to an 'int' type, so 32 bits. You can't fit much in there. And once you have such a value, how do you print it? Some of this you can take care of in your 'cake' product, and superimpose a particular spec on top of C (maybe they can be extended to 64 bits) but you probably can't do much about 'printf'. (In my language, I overhauled this part of it earlier this year. There it works like this: * Character literals can be 64 bits * They can represent up to 8 ASCII characters: 'ABCDEFGH' * They can include escape codes for both Unicode and UTF8, and multiple such characters can be specified: 'A\u20ACB' # All represent A€B; this is Unicode 'A\h EC 82 AC\B' # This is UTF8 'A\xEC\x82\xACB' # C-style escape Internally they are stored as UTF8, so the 20AC is converted to UTF8 * The ordering of the characters matches that of the equivalent "A\e20ACB" string when stored in memory; but this applies only to little-endian * Print routines have options to print the first character (which can be a Unicode one), or the whole sequence) Another aspect is when typing Unicode text directly via your text editor instead of using escape codes; will the C source be UTF8, or some other encoding? This will affect how the text is represented, and how much you can fit into one 32/64-bit literal.