Article <2024May27.082033@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024May27.082033@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: python text, Byte Addressability And Beyond
Date: Mon, 27 May 2024 06:20:33 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 26
Message-ID: <2024May27.082033@mips.complang.tuwien.ac.at>
References: <v0s17o$2okf4$2@dont-email.me> <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1ossl$1ps0$1@gal.iecc.com> <2024May12.074045@mips.complang.tuwien.ac.at> <v30mgo$3min8$3@dont-email.me>
Injection-Date: Mon, 27 May 2024 08:24:42 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ea08719dbf87fa9ea1796ef43ff1a7b7";
	logging-data="4110587"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18TsQSCzvEVtbLLBSED7S/O"
Cancel-Lock: sha1:noIww+MgFmojSAN1XhyKrlgGtFg=
X-newsreader: xrn 10.11
Bytes: 2283

Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>On Sun, 12 May 2024 05:40:45 GMT, Anton Ertl wrote:
>
>> This is a nice demonstration of the unnecessary complexity that the
>> codepoint mistake leads to. ...
>> 
>> But if they had decided to just store the data as UTF-8 and use byte
>> indexes and lengths in their API, and adjusted the rest of their API
>> accordingly, they could have avoided this complexity and
>> inefficiency ...
>
>But UTF-8 is just a representation of code points, not characters. So I 
>don’t understand why one way leads to “unnecessary complexity” and the 
>other way does not.

In UTF-32 a character is a sequence of code points.  In UTF-8 it is a
sequence of code units.  In either case, if you have to deal with
characters, you have to deal with sequences (and most of the code does
not have to deal with characters and even less code has to deal with
code points).  So converting to UTF-32 buys you nothing and is
unnecessary complexity.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>