Article <2024May12.181226@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024May12.181226@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: python text, Byte Addressability And Beyond
Date: Sun, 12 May 2024 16:12:26 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 104
Message-ID: <2024May12.181226@mips.complang.tuwien.ac.at>
References: <v0s17o$2okf4$2@dont-email.me> <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1ossl$1ps0$1@gal.iecc.com> <2024May12.074045@mips.complang.tuwien.ac.at> <v1q840$2mk58$1@dont-email.me>
Injection-Date: Sun, 12 May 2024 19:48:25 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fa63f79be1b668e15e3900e1b4d19fc8";
	logging-data="3011829"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX194lEXGDE4QeHdhBHZm+pfE"
Cancel-Lock: sha1:6Mm2rLi2TDRf/mEy/XGLfWlz+/U=
X-newsreader: xrn 10.11
Bytes: 6616

David Brown <david.brown@hesbynett.no> writes:
>On 12/05/2024 07:40, Anton Ertl wrote:
>> John Levine <johnl@taugh.com> writes:
>>> Python3 has a complex internal string format that stores each string
>>> as 1, 2, or 4 byte values, depending on what the contents of the
>>> string are, so ASCII is one byte, UCS-2 is two bytes, and strings that
>>> contain code points beyond UCS-2 are four bytes. It's not clear how
>>> hard they try to shrink stuff down when taking substrings.
>>>
>>> https://peps.python.org/pep-0393/
>> 
>> This is a nice demonstration of the unnecessary complexity that the
>> codepoint mistake leads to.
>
>A lot of this is, I suspect, for historical reasons.  When Python was 
>young, most software and languages used either plain ASCII or a mess of 
>code pages for 8-bit encodings (or an even bigger mess of 16-bit 
>encodings for CJK languages).  Unicode was the new hope for a unifying 
>16-bit system that would work for all characters in all languages.  So 
>Python - like Java, Windows NT, QT, and some other systems of that era, 
>chose UCS-2 as the modern, international and future-proof solution to 
>strings and characters.
>
>It turns out that UCS-2 was not enough, and these have all been 
>suffering from mixed APIs ever since.

That's certainly true for Java (first release 1995), Windows NT (first
released 1993) and QT (first released 1995).

At that time Unicode 1.x (released 1991) was supposed to be the wave
of the future, and it offered the (to Westerners) familiar environment
of character = code unit (= 16 bits), ignoring the experience of the
East Asians with ASCII-compatible variable-width encodings.  For new
systems the 16-bit code unit seemed to be the way to go, and the mixed
APIs directly stem from that, because they imagined that legacy
software that uses 8-bit code units would be rewritten to use 16-bit
code units after a while, but of course the new system has to run
legacy software, so it also provided a legacy API.

It did not work out.  Software using 8-bit code units was (for the
most part) not converted to use 16-bit code units, and 16 bits was
found to be not enough for a universal character set.

In the meantime, the Silicon Valley based Unicode effort was merged
with the ISO-based Universal Coded Character Set (UCS) effort (the
name Unicode was kept) and we got Unicode 2.0 in 1996.  Now if code
unit = character would have been as important as was thought in
Silicon Valley, the logical step would have been to go for 32-bit
characters.  But the UCS effort had brought in the experience with
ASCII-compatible variable-width encodings, and so we got not just
fixed-width UTF-32, but also variable-width ASCII-compatible UTF-8 and
variable-width UTF-16 (to be backwards compatible with the
systems/interfaces that were designed for 16-bit code units in the
early 1990s).

And, lo and behold, the systems that had adopted 16-bit code units
kept the 16-bit code units and accepted that characters were now
variable-width, because variable width is obviously easier to add to
an existing code base than switching the code unit size.


Plus at some point (not sure when) they decided that characters have
to be composable, so even an encoding like UTF-32 with 32-bit code
units would not be enough for a character.  A 32-bit code unit would
only be a code point.

At that point, all encodings are variable-width, so why not just use
UTF-8.  And that's what everyone who had not introduced a new platform
between 1991 and 1996 did.  E.g., that's what we see in Unix (from
around 1970) and in Rust (started 2006, first release 2015).

Except Python3.  I am not familiar with Python, but from the
discussions I have read my impression is: Python2 (released 2000)
supported strings of bytes, and people put UTF-8 in there and worked
with that.  Python3 (released 2008) was supposed to be a cleanup and
instead of refining the code-unit-based approach of Python2 they
introduced a code-point-based approach, which supported fast indexing
of code points, a worthless feature.  And they found out how hard it
is to migrate a code base.

So whatever the reason for the code point mistake in Python3 was, that
mistake was made long after Unicode 2.0 was introduced in 1996 and the
success of UTF-8 made it clear that variable-width encodings work out
fine.

For comparison: The 1994 Forth standard was designed to support 16-bit
characters, and one implementation, JaxForth, actually demonstrated
that.  Most Forth implementations kept 8-bit characters for the time
being, many assuming that they would have to do something like mixed
APIs at some point.  But when we actually thought and worked on the
issue in 2004/2005, we were delighted to discover that UTF-8 works
very well in the existing code base (of our Forth system and others)
and there are only a few places that need changes; the additional
words proposed in <http://www.euroforth.org/ef05/ertl-paysan05.pdf>
have mostly been standardized in Forth-2012, but are actually rarely
used, because ordinary string words don't care whether a string is
ASCII or UTF-8.  Anyway, this demonstrates that by 2005 it was clear
that variable-width encodings are very workable, so the Python3
mistake cannot be explained with its 2008 release date.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>