Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: python text, Byte Addressability And Beyond Date: Sun, 12 May 2024 16:12:26 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 104 Message-ID: <2024May12.181226@mips.complang.tuwien.ac.at> References: <2024May10.182047@mips.complang.tuwien.ac.at> <2024May11.173149@mips.complang.tuwien.ac.at> <2024May12.074045@mips.complang.tuwien.ac.at> Injection-Date: Sun, 12 May 2024 19:48:25 +0200 (CEST) Injection-Info: dont-email.me; posting-host="fa63f79be1b668e15e3900e1b4d19fc8"; logging-data="3011829"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX194lEXGDE4QeHdhBHZm+pfE" Cancel-Lock: sha1:6Mm2rLi2TDRf/mEy/XGLfWlz+/U= X-newsreader: xrn 10.11 Bytes: 6616 David Brown writes: >On 12/05/2024 07:40, Anton Ertl wrote: >> John Levine writes: >>> Python3 has a complex internal string format that stores each string >>> as 1, 2, or 4 byte values, depending on what the contents of the >>> string are, so ASCII is one byte, UCS-2 is two bytes, and strings that >>> contain code points beyond UCS-2 are four bytes. It's not clear how >>> hard they try to shrink stuff down when taking substrings. >>> >>> https://peps.python.org/pep-0393/ >> >> This is a nice demonstration of the unnecessary complexity that the >> codepoint mistake leads to. > >A lot of this is, I suspect, for historical reasons. When Python was >young, most software and languages used either plain ASCII or a mess of >code pages for 8-bit encodings (or an even bigger mess of 16-bit >encodings for CJK languages). Unicode was the new hope for a unifying >16-bit system that would work for all characters in all languages. So >Python - like Java, Windows NT, QT, and some other systems of that era, >chose UCS-2 as the modern, international and future-proof solution to >strings and characters. > >It turns out that UCS-2 was not enough, and these have all been >suffering from mixed APIs ever since. That's certainly true for Java (first release 1995), Windows NT (first released 1993) and QT (first released 1995). At that time Unicode 1.x (released 1991) was supposed to be the wave of the future, and it offered the (to Westerners) familiar environment of character = code unit (= 16 bits), ignoring the experience of the East Asians with ASCII-compatible variable-width encodings. For new systems the 16-bit code unit seemed to be the way to go, and the mixed APIs directly stem from that, because they imagined that legacy software that uses 8-bit code units would be rewritten to use 16-bit code units after a while, but of course the new system has to run legacy software, so it also provided a legacy API. It did not work out. Software using 8-bit code units was (for the most part) not converted to use 16-bit code units, and 16 bits was found to be not enough for a universal character set. In the meantime, the Silicon Valley based Unicode effort was merged with the ISO-based Universal Coded Character Set (UCS) effort (the name Unicode was kept) and we got Unicode 2.0 in 1996. Now if code unit = character would have been as important as was thought in Silicon Valley, the logical step would have been to go for 32-bit characters. But the UCS effort had brought in the experience with ASCII-compatible variable-width encodings, and so we got not just fixed-width UTF-32, but also variable-width ASCII-compatible UTF-8 and variable-width UTF-16 (to be backwards compatible with the systems/interfaces that were designed for 16-bit code units in the early 1990s). And, lo and behold, the systems that had adopted 16-bit code units kept the 16-bit code units and accepted that characters were now variable-width, because variable width is obviously easier to add to an existing code base than switching the code unit size. Plus at some point (not sure when) they decided that characters have to be composable, so even an encoding like UTF-32 with 32-bit code units would not be enough for a character. A 32-bit code unit would only be a code point. At that point, all encodings are variable-width, so why not just use UTF-8. And that's what everyone who had not introduced a new platform between 1991 and 1996 did. E.g., that's what we see in Unix (from around 1970) and in Rust (started 2006, first release 2015). Except Python3. I am not familiar with Python, but from the discussions I have read my impression is: Python2 (released 2000) supported strings of bytes, and people put UTF-8 in there and worked with that. Python3 (released 2008) was supposed to be a cleanup and instead of refining the code-unit-based approach of Python2 they introduced a code-point-based approach, which supported fast indexing of code points, a worthless feature. And they found out how hard it is to migrate a code base. So whatever the reason for the code point mistake in Python3 was, that mistake was made long after Unicode 2.0 was introduced in 1996 and the success of UTF-8 made it clear that variable-width encodings work out fine. For comparison: The 1994 Forth standard was designed to support 16-bit characters, and one implementation, JaxForth, actually demonstrated that. Most Forth implementations kept 8-bit characters for the time being, many assuming that they would have to do something like mixed APIs at some point. But when we actually thought and worked on the issue in 2004/2005, we were delighted to discover that UTF-8 works very well in the existing code base (of our Forth system and others) and there are only a few places that need changes; the additional words proposed in have mostly been standardized in Forth-2012, but are actually rarely used, because ordinary string words don't care whether a string is ASCII or UTF-8. Anyway, this demonstrates that by 2005 it was clear that variable-width encodings are very workable, so the Python3 mistake cannot be explained with its 2008 release date. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,