Deutsch English Français Italiano |
<2024May12.110053@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Unicode in strings (was: Byte Addressability And Beyond) Date: Sun, 12 May 2024 09:00:53 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 116 Message-ID: <2024May12.110053@mips.complang.tuwien.ac.at> References: <v0s17o$2okf4$2@dont-email.me> <4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org> <v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me> <v1h8l6$1ttd$1@gal.iecc.com> <v1kifk$17qh0$1@dont-email.me> <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1preb$2jn47$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Sun, 12 May 2024 12:36:39 +0200 (CEST) Injection-Info: dont-email.me; posting-host="fa63f79be1b668e15e3900e1b4d19fc8"; logging-data="2823554"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0O3GxUC7zEg6b1IZc+m1D" Cancel-Lock: sha1:OER+WQrcqH8PXcWHPhj0IwiQgRc= X-newsreader: xrn 10.11 Bytes: 6281 Thomas Koenig <tkoenig@netcologne.de> writes: >Anton Ertl <anton@mips.complang.tuwien.ac.at> schrieb: >> The point I wanted to make is that there is the frequent >> misconception that dealing with individual arbitrary characters is >> something that is relatively common, and that one can do that by using >> UTF-32 (or UTF-16); it isn't, and one cannot. > >Do you really mean one cannot change an individual character >using UTF-32? Correct. That's the "one cannot" part. An Unicode code-point is not a character, and what UTF-32 gives you is one code point per code unit (a code unit is a fixed size container, 32 bits for UTF-32, 8 bits for UTF-8), not one character per code unit. But Unicode supports characters that consist of a sequence of several code points, see <https://en.wikipedia.org/wiki/Combining_character>, so if you just store one Unicode code to the address where a different code point currently is, you have not overwritten a character, just a code point; admittedly, the result is that you have changed one or two characters, but that's probably not what the user wanted. E.g., consider the following Gforth code (others can tell you how to do it in Python): "Ko\u0308nig" cr type The output is: König That is, the second character consists of two Unicode code points, the "o" and the "\u0308" (Combining Diaeresis). (I think that somewhere along the way from the Forth system to the xterm through copying and pasting into Emacs the second character has become precomposed, but that's probably just as well, so you can see what I see). If I replace the third code point with an e, I get "Koenig". So by overwriting one code point, I insert a character into the string. If instead I replace the second code point with a "\u0316" (Combining Grave Accent Below): "K\u0316\u0308nig" cr type I get this (which looks as expected in my xterm, but not in Emacs) K̖̈nig The first character is now a K with a diaresis above and an accent grave below and there are now a total of 4 characters, but still 6 code points in the string; the second character has been deleted by this code-point replacement. Back to replacing characters instead of overwriting code points: If you want to replace the second character, you would need to replace two code points; if the replacement of the character has only one code point or more than two, you need to move the remaining three characters. You have this problem whether the string is represented as UTF-32 or UTF-8. >I assume you mean "there is no need to do it".. That, too. That is the "it isn't" part of the statement. >>If you stick with UTF-8 >> and use byte lengths and byte indexes, you can do almost everything as >> well or better (with less complication and more efficiently) as by >> converting to UTF-32 and back. > >Assume you're implementing a language which has a function of >setting an individual character in a string. That's a design mistake in the language, and I know no language that has this misfeature. Instead, what we see is one language (Python3) that has an even worse misfeature: You can set an individual code point in a string; see above for the things you get when you overwrite code points. But why would one want to set individual code points? What about setting individual code units (in the case of UTF-8, the code unit is a byte) or bits? If you think that replacing parts of a character is a feature, why not go all the way? >How would you implement it? Run through the string? You have to do that anyway, because of combining characters. >Would you then also >store additional information somewhere so that the next character >that the user sets does not need to do it again? Probably not. I would discourage the users from using this misfeature and steer them to better alternatives. Alternatively, if it's a really important misfeature, I would use an editing-friendly string representation (maybe a piece table or rope) and/or maybe do some Python3-style crazyness and have the string be represented by an array of characters, and every character is represented by a pointer into an UTF-8 sequence. In the case of Python3, the sequence seems to have been that they started out with the bad idea that indexing a string by code point is the way to go, and then designed a first implementation catering to that premise, and published it without reconsidering the premise, despite the efficiency cost. And of couse it was too inefficient for some use cases, but it was too late to switch to a more sensible design, so they invented the more complex, but more efficient (than the first implementation) PEP 393 implementation. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>