Deutsch English Français Italiano |
<v4epff$2912j$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: David Brown <david.brown@hesbynett.no> Newsgroups: comp.lang.c Subject: Re: "undefined behavior"? Date: Thu, 13 Jun 2024 14:42:22 +0200 Organization: A noiseless patient Spider Lines: 54 Message-ID: <v4epff$2912j$1@dont-email.me> References: <666a095a$0$952$882e4bbb@reader.netnews.com> <v4d4h5$1rc9e$1@dont-email.me> <877cet7qkl.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 13 Jun 2024 14:42:23 +0200 (CEST) Injection-Info: dont-email.me; posting-host="a4019a0b35be2744ae8acc392e7d37ca"; logging-data="2393171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+F4LwJlHhY9xMis5xj425b/RIw/O8asfE=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cancel-Lock: sha1:DBCK0oXl03YZf8u3cJeYXJh2do0= Content-Language: en-GB In-Reply-To: <877cet7qkl.fsf@nosuchdomain.example.com> Bytes: 3754 On 13/06/2024 00:18, Keith Thompson wrote: > David Brown <david.brown@hesbynett.no> writes: > [...] >> I recommend never using "char" as a type unless you really mean a >> character, limited to 7-bit ASCII. So if your "outliers" array really >> is an array of such characters, "char" is fine. If it is intended to >> be numbers and for some reason you specifically want 8-bit values, use >> "uint8_t" or "int8_t", and initialise with { 0 }. > [...] > > The implementation-definedness of plain char is awkward, but char > arrays generally work just fine for UTF-8 strings. Yes, but "generally work" is not quite as strong as I would like. My preference for UTF-8 strings is a const unsigned char type (with C23, it will be char8_t, which is defined to be the same type as "unsigned char"). But u8"Hello, world" UTF-8 string literals (since C11) are considered to be like an array of type "char" in C (until C23), so I guess UTF-8 strings will be safe in plain char arrays. Still, the bytes in a UTF-8 strings are code units with values between 0 and 255, so I prefer to store these in a type that can hold that range of values. (What happens if you have a platform that uses ones' complement arithmetic, with "char" being signed and a range of -127 to +127, and you have a u8"..." string which has a code unit of 0x80 that cannot be represented in "char" ? It's just a hypothetical question, of course.) > If char is > signed, byte values greater than 127 will be stored as negative > values, but it will almost certainly just work (if your system > is configured to handle UTF-8). Likewise for Latin-1 and similar > 8-bit character sets. > > The standard string functions operate on arrays of plain char, so > storing UTF-8 strings in arrays of uint8_t or unsigned char will > seriously restrict what you can do with them. > > (I'd like to a future standard require plain char to be unsigned, > but I don't know how likely that is.) > I would also prefer that, but too much existing code relies on plain char being signed on the platforms it runs on. I personally think the idea of having signed or unsigned characters is a very poor choice of names for the terms, but it's way too late to change that! C23 has "char8_t" which is always unsigned. (In C23, "char8_t" is defined in <uchar.h> and is the same type as "unsigned char". In C++20, in contrast, "char8_t" is a keyword and a distinct type with identical size and range to "unsigned char".)