Article <jwved9t656u.fsf-monnier+comp.arch@gnu.org>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <jwved9t656u.fsf-monnier+comp.arch@gnu.org>

Deutsch English Français Italiano

<jwved9t656u.fsf-monnier+comp.arch@gnu.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: comp.arch
Subject: Re: Unicode in strings
Date: Wed, 22 May 2024 15:38:51 -0400
Organization: A noiseless patient Spider
Lines: 66
Message-ID: <jwved9t656u.fsf-monnier+comp.arch@gnu.org>
References: <v0s17o$2okf4$2@dont-email.me>
	<4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org>
	<v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me>
	<v1h8l6$1ttd$1@gal.iecc.com> <v1kifk$17qh0$1@dont-email.me>
	<2024May10.182047@mips.complang.tuwien.ac.at>
	<v1ns43$2260p$1@dont-email.me>
	<2024May11.173149@mips.complang.tuwien.ac.at>
	<v1preb$2jn47$1@dont-email.me>
	<2024May12.110053@mips.complang.tuwien.ac.at>
	<jwvjzjwid50.fsf-monnier+comp.arch@gnu.org>
	<2024May18.072920@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 22 May 2024 21:38:53 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="3d9aa23582dfc5c5d2d96c771ad735d2";
	logging-data="1408206"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18ZxXq3ZpwNYkNsX7Y/K2jLc+UhAq+b9Jo="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:ELFcq/WqkD8fd1/LD3Ur8DURAy4=
	sha1:6LaYL8pGJZxTZwei+G3kUnbnEKQ=
Bytes: 4499

>>>> Assume you're implementing a language which has a function of setting
>>>> an individual character in a string.
>>> That's a design mistake in the language, and I know no language that
>>> has this misfeature.
>> I suspect "individual character" meant "code point" above.
> I meant character, not code point, as should have become clear from
> the following.  I think that Thomas Koenig meant "character", too, but
> he may have been unaware of the difference between "character" and
> "Unicode code point".

I don't know of any language (or even library) that supports the notion
of "character" for Unicode strings.  🙁

> OTOH, most code can be implemented fine as working on strings, without
> knowing how many characters there are in the string (and it then does
> not need to know about code points, either).

Indeed, most operations on strings are conversion of things to strings,
concatenation of strings, search (typically for a substring or a regexp),
extraction of substring where the boundaries result from an earlier
search, and parsing (which at the bottom relies often on some sort of
regexp or equivalent system).

All of those work just fine on a UTF-8 sequence of bytes.

>> Emacs Lisp has this misfeature as well (and so does Common Lisp).  🙁
>> It's really hard to get rid of it, even though it's used *very* rarely.
>> In ELisp, strings are represented internally as utf-8 (tho it pretends
>> to be an array opf code points), so an assignment that replaces a single
>> char can require reallocating the array!
> One way forward might be to also provide a string-oriented API with
> byte (code unit) indices, and recommend that people use that instead
> of the inefficient code-point-indexed API.

I think the long term solution for ELisp will be to declare strings as
basically immutable.

>> Because you know your string only contains "characters" made of a single
>> code point?
>
> This incorrect "knowledge" may be the reason why Emacs 27.1 displays
>
> K̖̈nig
>
> as if the first three-code-point character actually was three characters.

No, the above seems like a problem in the redisplay code, and that code
is quite aware of combining characters and stuff.  You're probably
seeing simply a missing rule to allow composition/shaping of your word.
(the composition/shaping library operates on whole strings at a time,
but Emacs tends to be quite conservative about the string-chunks it
sends to that library).

I recommend you `M-x report-emacs-bug`.  The fix should be fairly simple.

>> E.g. your string contains the representation of the border of a table
>> (to be displayed in a tty), and you want to "move" the `+` of a column
>> separator (or a prettier version that takes advantage of the wider
>> choice offered by Unicode).
> These kinds of things involve additional complications.

Very much so, indeed.  It usually breaks down in many different ways
because of the common-but-not-guaranteed assumptions.


        Stefan