Article <2024May12.110053@mips.complang.tuwien.ac.at>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <2024May12.110053@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano

<2024May12.110053@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!weretis.net!feeder9.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Unicode in strings (was: Byte Addressability And Beyond)
Date: Sun, 12 May 2024 09:00:53 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 116
Message-ID: <2024May12.110053@mips.complang.tuwien.ac.at>
References: <v0s17o$2okf4$2@dont-email.me> <4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org> <v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me> <v1h8l6$1ttd$1@gal.iecc.com> <v1kifk$17qh0$1@dont-email.me> <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1preb$2jn47$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 12 May 2024 12:36:39 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fa63f79be1b668e15e3900e1b4d19fc8";
	logging-data="2823554"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/0O3GxUC7zEg6b1IZc+m1D"
Cancel-Lock: sha1:OER+WQrcqH8PXcWHPhj0IwiQgRc=
X-newsreader: xrn 10.11
Bytes: 6281

Thomas Koenig <tkoenig@netcologne.de> writes:
>Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb:
>> The point I wanted to make is that there is the frequent
>> misconception that dealing with individual arbitrary characters is
>> something that is relatively common, and that one can do that by using
>> UTF-32 (or UTF-16); it isn't, and one cannot.
>
>Do you really mean one cannot change an individual character
>using UTF-32?

Correct.  That's the "one cannot" part.  An Unicode code-point is not
a character, and what UTF-32 gives you is one code point per code unit
(a code unit is a fixed size container, 32 bits for UTF-32, 8 bits for
UTF-8), not one character per code unit.  But Unicode supports
characters that consist of a sequence of several code points, see
<https://en.wikipedia.org/wiki/Combining_character>, so if you just
store one Unicode code to the address where a different code point
currently is, you have not overwritten a character, just a code point;
admittedly, the result is that you have changed one or two characters,
but that's probably not what the user wanted.

E.g., consider the following Gforth code (others can tell you how to
do it in Python):

"Ko\u0308nig" cr type

The output is:

König

That is, the second character consists of two Unicode code points, the
"o" and the "\u0308" (Combining Diaeresis).

(I think that somewhere along the way from the Forth system to the
xterm through copying and pasting into Emacs the second character has
become precomposed, but that's probably just as well, so you can see
what I see).

If I replace the third code point with an e, I get "Koenig".  So by
overwriting one code point, I insert a character into the string.

If instead I replace the second code point with a "\u0316" (Combining
Grave Accent Below):

"K\u0316\u0308nig" cr type

I get this (which looks as expected in my xterm, but not in Emacs)

K̖̈nig

The first character is now a K with a diaresis above and an accent
grave below and there are now a total of 4 characters, but still 6
code points in the string; the second character has been deleted by
this code-point replacement.

Back to replacing characters instead of overwriting code points: If
you want to replace the second character, you would need to replace
two code points; if the replacement of the character has only one code
point or more than two, you need to move the remaining three
characters.  You have this problem whether the string is represented
as UTF-32 or UTF-8.

>I assume you mean "there is no need to do it"..

That, too.  That is the "it isn't" part of the statement.

>>If you stick with UTF-8
>> and use byte lengths and byte indexes, you can do almost everything as
>> well or better (with less complication and more efficiently) as by
>> converting to UTF-32 and back.
>
>Assume you're implementing a language which has a function of
>setting an individual character in a string.

That's a design mistake in the language, and I know no language that
has this misfeature.

Instead, what we see is one language (Python3) that has an even worse
misfeature: You can set an individual code point in a string; see
above for the things you get when you overwrite code points.

But why would one want to set individual code points?  What about
setting individual code units (in the case of UTF-8, the code unit is
a byte) or bits?  If you think that replacing parts of a character is
a feature, why not go all the way?

>How would you implement it?  Run through the string?

You have to do that anyway, because of combining characters.

>Would you then also
>store additional information somewhere so that the next character
>that the user sets does not need to do it again?

Probably not.  I would discourage the users from using this misfeature
and steer them to better alternatives.

Alternatively, if it's a really important misfeature, I would use an
editing-friendly string representation (maybe a piece table or rope)
and/or maybe do some Python3-style crazyness and have the string be
represented by an array of characters, and every character is
represented by a pointer into an UTF-8 sequence.

In the case of Python3, the sequence seems to have been that they
started out with the bad idea that indexing a string by code point is
the way to go, and then designed a first implementation catering to
that premise, and published it without reconsidering the premise,
despite the efficiency cost.  And of couse it was too inefficient for
some use cases, but it was too late to switch to a more sensible
design, so they invented the more complex, but more efficient (than
the first implementation) PEP 393 implementation.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>