Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: python text, Byte Addressability And Beyond Date: Mon, 27 May 2024 06:20:33 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 26 Message-ID: <2024May27.082033@mips.complang.tuwien.ac.at> References: <2024May10.182047@mips.complang.tuwien.ac.at> <2024May11.173149@mips.complang.tuwien.ac.at> <2024May12.074045@mips.complang.tuwien.ac.at> Injection-Date: Mon, 27 May 2024 08:24:42 +0200 (CEST) Injection-Info: dont-email.me; posting-host="ea08719dbf87fa9ea1796ef43ff1a7b7"; logging-data="4110587"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18TsQSCzvEVtbLLBSED7S/O" Cancel-Lock: sha1:noIww+MgFmojSAN1XhyKrlgGtFg= X-newsreader: xrn 10.11 Bytes: 2283 Lawrence D'Oliveiro writes: >On Sun, 12 May 2024 05:40:45 GMT, Anton Ertl wrote: > >> This is a nice demonstration of the unnecessary complexity that the >> codepoint mistake leads to. ... >> >> But if they had decided to just store the data as UTF-8 and use byte >> indexes and lengths in their API, and adjusted the rest of their API >> accordingly, they could have avoided this complexity and >> inefficiency ... > >But UTF-8 is just a representation of code points, not characters. So I >don’t understand why one way leads to “unnecessary complexity” and the >other way does not. In UTF-32 a character is a sequence of code points. In UTF-8 it is a sequence of code units. In either case, if you have to deal with characters, you have to deal with sequences (and most of the code does not have to deal with characters and even less code has to deal with code points). So converting to UTF-32 buys you nothing and is unnecessary complexity. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,