Article <v4epff$2912j$1@dont-email.me>

Deutsch English Français Italiano
<v4epff$2912j$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: "undefined behavior"?
Date: Thu, 13 Jun 2024 14:42:22 +0200
Organization: A noiseless patient Spider
Lines: 54
Message-ID: <v4epff$2912j$1@dont-email.me>
References: <666a095a$0$952$882e4bbb@reader.netnews.com>
 <v4d4h5$1rc9e$1@dont-email.me> <877cet7qkl.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 13 Jun 2024 14:42:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a4019a0b35be2744ae8acc392e7d37ca";
	logging-data="2393171"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+F4LwJlHhY9xMis5xj425b/RIw/O8asfE="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:DBCK0oXl03YZf8u3cJeYXJh2do0=
Content-Language: en-GB
In-Reply-To: <877cet7qkl.fsf@nosuchdomain.example.com>
Bytes: 3754

On 13/06/2024 00:18, Keith Thompson wrote:
> David Brown <david.brown@hesbynett.no> writes:
> [...]
>> I recommend never using "char" as a type unless you really mean a
>> character, limited to 7-bit ASCII.  So if your "outliers" array really
>> is an array of such characters, "char" is fine.  If it is intended to
>> be numbers and for some reason you specifically want 8-bit values, use
>> "uint8_t" or "int8_t", and initialise with { 0 }.
> [...]
> 
> The implementation-definedness of plain char is awkward, but char
> arrays generally work just fine for UTF-8 strings.

Yes, but "generally work" is not quite as strong as I would like.  My 
preference for UTF-8 strings is a const unsigned char type (with C23, it 
will be char8_t, which is defined to be the same type as "unsigned 
char").  But u8"Hello, world" UTF-8 string literals (since C11) are 
considered to be like an array of type "char" in C (until C23), so I 
guess UTF-8 strings will be safe in plain char arrays.  Still, the bytes 
in a UTF-8 strings are code units with values between 0 and 255, so I 
prefer to store these in a type that can hold that range of values.

(What happens if you have a platform that uses ones' complement 
arithmetic, with "char" being signed and a range of -127 to +127, and 
you have a u8"..." string which has a code unit of 0x80 that cannot be 
represented in "char" ?  It's just a hypothetical question, of course.)


>  If char is
> signed, byte values greater than 127 will be stored as negative
> values, but it will almost certainly just work (if your system
> is configured to handle UTF-8).  Likewise for Latin-1 and similar
> 8-bit character sets.
> 
> The standard string functions operate on arrays of plain char, so
> storing UTF-8 strings in arrays of uint8_t or unsigned char will
> seriously restrict what you can do with them.
> 
> (I'd like to a future standard require plain char to be unsigned,
> but I don't know how likely that is.)
> 

I would also prefer that, but too much existing code relies on plain 
char being signed on the platforms it runs on.  I personally think the 
idea of having signed or unsigned characters is a very poor choice of 
names for the terms, but it's way too late to change that!  C23 has 
"char8_t" which is always unsigned.

(In C23, "char8_t" is defined in <uchar.h> and is the same type as 
"unsigned char".  In C++20, in contrast, "char8_t" is a keyword and a 
distinct type with identical size and range to "unsigned char".)