| Deutsch English Français Italiano |
|
<v4epff$2912j$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: "undefined behavior"?
Date: Thu, 13 Jun 2024 14:42:22 +0200
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <v4epff$2912j$1@dont-email.me>
References: <666a095a$0$952$882e4bbb@reader.netnews.com>
<v4d4h5$1rc9e$1@dont-email.me> <877cet7qkl.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 13 Jun 2024 14:42:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a4019a0b35be2744ae8acc392e7d37ca";
logging-data="2393171"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+F4LwJlHhY9xMis5xj425b/RIw/O8asfE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
Thunderbird/102.11.0
Cancel-Lock: sha1:DBCK0oXl03YZf8u3cJeYXJh2do0=
Content-Language: en-GB
In-Reply-To: <877cet7qkl.fsf@nosuchdomain.example.com>
Bytes: 3754
On 13/06/2024 00:18, Keith Thompson wrote:
> David Brown <david.brown@hesbynett.no> writes:
> [...]
>> I recommend never using "char" as a type unless you really mean a
>> character, limited to 7-bit ASCII. So if your "outliers" array really
>> is an array of such characters, "char" is fine. If it is intended to
>> be numbers and for some reason you specifically want 8-bit values, use
>> "uint8_t" or "int8_t", and initialise with { 0 }.
> [...]
>
> The implementation-definedness of plain char is awkward, but char
> arrays generally work just fine for UTF-8 strings.
Yes, but "generally work" is not quite as strong as I would like. My
preference for UTF-8 strings is a const unsigned char type (with C23, it
will be char8_t, which is defined to be the same type as "unsigned
char"). But u8"Hello, world" UTF-8 string literals (since C11) are
considered to be like an array of type "char" in C (until C23), so I
guess UTF-8 strings will be safe in plain char arrays. Still, the bytes
in a UTF-8 strings are code units with values between 0 and 255, so I
prefer to store these in a type that can hold that range of values.
(What happens if you have a platform that uses ones' complement
arithmetic, with "char" being signed and a range of -127 to +127, and
you have a u8"..." string which has a code unit of 0x80 that cannot be
represented in "char" ? It's just a hypothetical question, of course.)
> If char is
> signed, byte values greater than 127 will be stored as negative
> values, but it will almost certainly just work (if your system
> is configured to handle UTF-8). Likewise for Latin-1 and similar
> 8-bit character sets.
>
> The standard string functions operate on arrays of plain char, so
> storing UTF-8 strings in arrays of uint8_t or unsigned char will
> seriously restrict what you can do with them.
>
> (I'd like to a future standard require plain char to be unsigned,
> but I don't know how likely that is.)
>
I would also prefer that, but too much existing code relies on plain
char being signed on the platforms it runs on. I personally think the
idea of having signed or unsigned characters is a very poor choice of
names for the terms, but it's way too late to change that! C23 has
"char8_t" which is always unsigned.
(In C23, "char8_t" is defined in <uchar.h> and is the same type as
"unsigned char". In C++20, in contrast, "char8_t" is a keyword and a
distinct type with identical size and range to "unsigned char".)