Deutsch English Français Italiano |
<jwved9t656u.fsf-monnier+comp.arch@gnu.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Stefan Monnier <monnier@iro.umontreal.ca> Newsgroups: comp.arch Subject: Re: Unicode in strings Date: Wed, 22 May 2024 15:38:51 -0400 Organization: A noiseless patient Spider Lines: 66 Message-ID: <jwved9t656u.fsf-monnier+comp.arch@gnu.org> References: <v0s17o$2okf4$2@dont-email.me> <4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org> <v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me> <v1h8l6$1ttd$1@gal.iecc.com> <v1kifk$17qh0$1@dont-email.me> <2024May10.182047@mips.complang.tuwien.ac.at> <v1ns43$2260p$1@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1preb$2jn47$1@dont-email.me> <2024May12.110053@mips.complang.tuwien.ac.at> <jwvjzjwid50.fsf-monnier+comp.arch@gnu.org> <2024May18.072920@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Wed, 22 May 2024 21:38:53 +0200 (CEST) Injection-Info: dont-email.me; posting-host="3d9aa23582dfc5c5d2d96c771ad735d2"; logging-data="1408206"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18ZxXq3ZpwNYkNsX7Y/K2jLc+UhAq+b9Jo=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:ELFcq/WqkD8fd1/LD3Ur8DURAy4= sha1:6LaYL8pGJZxTZwei+G3kUnbnEKQ= Bytes: 4499 >>>> Assume you're implementing a language which has a function of setting >>>> an individual character in a string. >>> That's a design mistake in the language, and I know no language that >>> has this misfeature. >> I suspect "individual character" meant "code point" above. > I meant character, not code point, as should have become clear from > the following. I think that Thomas Koenig meant "character", too, but > he may have been unaware of the difference between "character" and > "Unicode code point". I don't know of any language (or even library) that supports the notion of "character" for Unicode strings. 🙁 > OTOH, most code can be implemented fine as working on strings, without > knowing how many characters there are in the string (and it then does > not need to know about code points, either). Indeed, most operations on strings are conversion of things to strings, concatenation of strings, search (typically for a substring or a regexp), extraction of substring where the boundaries result from an earlier search, and parsing (which at the bottom relies often on some sort of regexp or equivalent system). All of those work just fine on a UTF-8 sequence of bytes. >> Emacs Lisp has this misfeature as well (and so does Common Lisp). 🙁 >> It's really hard to get rid of it, even though it's used *very* rarely. >> In ELisp, strings are represented internally as utf-8 (tho it pretends >> to be an array opf code points), so an assignment that replaces a single >> char can require reallocating the array! > One way forward might be to also provide a string-oriented API with > byte (code unit) indices, and recommend that people use that instead > of the inefficient code-point-indexed API. I think the long term solution for ELisp will be to declare strings as basically immutable. >> Because you know your string only contains "characters" made of a single >> code point? > > This incorrect "knowledge" may be the reason why Emacs 27.1 displays > > K̖̈nig > > as if the first three-code-point character actually was three characters. No, the above seems like a problem in the redisplay code, and that code is quite aware of combining characters and stuff. You're probably seeing simply a missing rule to allow composition/shaping of your word. (the composition/shaping library operates on whole strings at a time, but Emacs tends to be quite conservative about the string-chunks it sends to that library). I recommend you `M-x report-emacs-bug`. The fix should be fairly simple. >> E.g. your string contains the representation of the border of a table >> (to be displayed in a tty), and you want to "move" the `+` of a column >> separator (or a prettier version that takes advantage of the wider >> choice offered by Unicode). > These kinds of things involve additional complications. Very much so, indeed. It usually breaks down in many different ways because of the common-but-not-guaranteed assumptions. Stefan