Deutsch English Français Italiano |
<v9ibkf$e3c6$2@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Thiago Adams <thiago.adams@gmail.com> Newsgroups: comp.lang.c Subject: Re: multi bytes character - how to make it defined behavior? Date: Wed, 14 Aug 2024 10:31:59 -0300 Organization: A noiseless patient Spider Lines: 93 Message-ID: <v9ibkf$e3c6$2@dont-email.me> References: <v9frim$3u7qi$1@dont-email.me> <v9grjd$4cjd$1@dont-email.me> <87sev8eydx.fsf@nosuchdomain.example.com> <v9i54d$e3c6$1@dont-email.me> <v9ia2i$f3p2$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 14 Aug 2024 15:32:00 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f8b9718c855d7d90766802f0edc42262"; logging-data="462214"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19nlG8fOeh86azFllu2mOdse9NLVn6OnmY=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:beJ2Eyqv4AUy9L1E9E+cDRYBtsg= In-Reply-To: <v9ia2i$f3p2$1@dont-email.me> Content-Language: en-US Bytes: 4484 On 14/08/2024 10:05, Bart wrote: > On 14/08/2024 12:41, Thiago Adams wrote: >> On 13/08/2024 21:33, Keith Thompson wrote: >>> Bart<bc@freeuk.com> writes: >>> [...] >>>> What exactly do you mean by multi-byte characters? Is it a literal >>>> such as 'ABCD'? >>>> >>>> I've no idea what C makes of that, >>> It's a character constant of type int with an implementation-defined >>> value. Read the section on "Character constants" in the C standard >>> (6.4.4.4 in C17). >>> >>> (With gcc, its value is 0x41424344, but other compilers can and do >>> behave differently.) >>> >>> We discussed this at some length several years ago. >>> >>> [...] >> >> >> "An integer character constant has type int. The value of an integer >> character constant containing >> a single character that maps to a single value in the literal encoding >> (6.2.9) is the numerical value >> of the representation of the mapped character in the literal encoding >> interpreted as an integer. >> The value of an integer character constant containing more than one >> character (e.g. ’ab’), or >> containing a character or escape sequence that does not map to a >> single value in the literal encoding, >> is implementation-defined. If an integer character constant contains a >> single character or escape >> sequence, its value is the one that results when an object with type >> char whose value is that of the >> single character or escape sequence is converted to type int." >> >> >> I am suggesting the define this: >> >> "The value of an integer character constant containing more than one >> character (e.g. ’ab’), or containing a character or escape sequence >> that does not map to a single value in the literal encoding, is >> implementation-defined." >> >> How? >> >> First, all source code should be utf8. >> >> Then I am suggesting we first decode the bytes. >> >> For instance, '×' is encoded with 195 and 151. We consume these 2 >> bytes and the utf8 decoded value is 215. > > By that you mean the Unicode index. But you say elsewhere that > everything in your source code is UTF8. 215 is the unicode number of the character '×'. > Where then does the 215 appear? Do your char* strings use 215 for ×, or > do they use 195 and 215? 215 is the result of decoding two utf8 encoded bytes. (195 and 151) > I think this is why C requires those prefixes like u8'...'. >> >> Then this is the defined behavior >> >> static_assert('×' == 215) > > This is where you need to decide whether the integer value within '...', > AT RUNTIME, represents the Unicode index or the UTF8 sequence. why runtime? It is compile time. This is why source code must be universally encoded (utf8) > (In my language, though I do very little with Unicode ATM, I decided > that everything is UTF8 both at compile time and runtime. Unless I > explicitly expand a UTF8 u8[] string to a u32[] or i32[] array (either > will work), which contains 21-bit Unicode index values.) > > I get the impression that C's wide characters are intended for those > Unicode indices, but that's not going to work well on Windows with its > 16-bit wide character type. > nowadays wide is just for windows API compatibility.