Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.arch
Subject: Re: Byte Addressability And Beyond
Date: Sat, 11 May 2024 18:49:12 +0200
Organization: A noiseless patient Spider
Lines: 67
Message-ID: <v1o7i8$24m7i$1@dont-email.me>
References: <v0s17o$2okf4$2@dont-email.me>
 <4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org>
 <v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me>
 <v1h8l6$1ttd$1@gal.iecc.com> <v1kifk$17qh0$1@dont-email.me>
 <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me>
 <2024May11.173149@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sat, 11 May 2024 18:49:13 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e065d27d964d081eeac047b5b066e87e";
	logging-data="2250994"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX187BF1YC3FE7Nt5IH8JouIkW3Ip8HOhQxU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:YqOfZu6mi9AaLIrd3wWukxgdpg0=
In-Reply-To: <2024May11.173149@mips.complang.tuwien.ac.at>
Content-Language: en-GB
Bytes: 4586

On 11/05/2024 17:31, Anton Ertl wrote:
> David Brown <david.brown@hesbynett.no> writes:
>> On 10/05/2024 18:20, Anton Ertl wrote:
>>> 1) I only came up with the following use cases where you need to deal
>>> with individual non-ASCII characters: Palindrome checkers and anagram
>>> programs; I remember somebody mentioning a third use (which I promptly
>>> forgot), but anyway, there are few cases.
>>>
>>> 2) But even for those few cases, UTF-32 is not good enough, because a
>>> code point is not a character.
>>>
>>
>> I agree that it is usually unnecessary to convert to UTF-32 - I am
>> merely saying that /if/ you feel you want to expand the code points,
>> then UTF-32 is fine for the purpose and you should not have to worry
>> about endianness because you should not be moving it off your computer,
>> thus native endianness is all you need.
> 
> Yes.  The point I wanted to make is that there is the frequent
> misconception that dealing with individual arbitrary characters is
> something that is relatively common, and that one can do that by using
> UTF-32 (or UTF-16); it isn't, and one cannot.  If you stick with UTF-8
> and use byte lengths and byte indexes, you can do almost everything as
> well or better (with less complication and more efficiently) as by
> converting to UTF-32 and back.
> 

Agreed.

>> People sometimes say they want to expand to code points to be able to
>> see the length of the string in characters, or to index characters, or
>> for easier splicing or joining strings.  I don't think these are
>> particularly useful in practice, but UTF-32 is fine for those that want it.
> 
> Looking up "splicing strings", I find that this is a term used in
> connection with Python for specifying substrings.  Python3 is a
> language that lives the codepoint mistake to the extreme (and from
> what I read, this was one of the major pain points in the
> Python2->Python3 transition), but anyway, with UTF-8 one way to
> represent a substring is to use the start index and length in bytes
> (aka code units) rather than code points.
> 

I was not thinking of Python in particular, and I don't think the term 
"splicing" is Python specific.  But Python is generally a good and 
popular language when you need to do lots of text manipulation, so maybe 
that's where the association comes from (at least for search engines).

People often think it is easier to do string manipulation - joining, 
splitting, replacing, etc., - when you have fixed size units per 
character.  I agree with you that this is not actually true, especially 
if you want to support arbitrary Unicode characters (such as combining 
characters) that don't fit in a single code point.  But it is not 
uncommon to think it is, and if you can make some simplifications to the 
text you support (specifically, limiting your code to single code point 
characters) then UTF-32 can be helpful.  (I think everyone will at least 
agree that it's better than UTF-16!)

> Looking up "joining strings" brings up the Python join() method, which
> is a variant of string concatenation.  There is certainly no need to
> convert UTF-8 to UTF-32 and back for concatenating strings; just
> concatenate the UTF-8 strings.
> 

Sure.