Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Unicode in strings
Date: Fri, 31 May 2024 17:21:53 +0000
Organization: Rocksolid Light
Message-ID: <5db8e3e2060c479d61d05cfad35d7701@www.novabbs.org>
References: <v0s17o$2okf4$2@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1preb$2jn47$1@dont-email.me> <2024May12.110053@mips.complang.tuwien.ac.at> <jwvjzjwid50.fsf-monnier+comp.arch@gnu.org> <2024May18.072920@mips.complang.tuwien.ac.at> <jwved9t656u.fsf-monnier+comp.arch@gnu.org> <2024May25.174807@mips.complang.tuwien.ac.at> <jwvy17ty8v7.fsf-monnier+comp.arch@gnu.org> <2024May29.085955@mips.complang.tuwien.ac.at> <jwv5xuwwuqe.fsf-monnier+comp.arch@gnu.org> <2024May30.182546@mips.complang.tuwien.ac.at> <v3d0hj$2amga$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="2786343"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Rz9wNGJGccTgkLdXLWcmSuzfWbneHRNanw91FQbFKnoCLvZbEliQi
Bytes: 3821
Lines: 60

BGB wrote:

> On 5/30/2024 11:25 AM, Anton Ertl wrote:
>> Stefan Monnier <monnier@iro.umontreal.ca> writes:
>>> I'm not sure the codepoint-oriented API is the best option, but it's
>>> not
>>> completely clear what *is* the best option.  You mention a
>>> byte-oriented
>>> API and you might be right that it's a better option, but in the case
>>> of
>>> Emacs that's what we used in Emacs-20.1 but it worked really poorly
>>> because of backward compatibility issues.  I think if we started from
>>> scratch now (i.e. without having to contend with backward
>>> compatibility,
>>> and with a better understanding of Unicode (which barely existed back
>>> then)) it might work better, indeed, but that's not been an option
>> 
>> Plus, editors are among the very few uses where you have to deal with
>> individual characters, so the "treat it as opaque string" approach
>> that works so well for most other code is not good enough there.  The
>> command-line editor of Gforth is one case where we use the xchar words
>> (those for dealing with code points of UTF-8).
>> 

> Yeah.

> For text editors, this is one of the few cases it makes sense to use 32
> 
> or 64 bit characters (say, combining the 'character' with some 
> additional metadata such as formatting).

> Though, one thing that makes sense for text editors is if only the 
> "currently being edited" lines are fully unpacked, whereas the others 
> can remain in a more compact form (such as UTF-8), and are then
> unpacked
> 
> as they come into view (say, treating the editor window as a 32-entry 
> modulo cache or similar).

> For the rest, say, one can have, say, a big buffer, with an array of 
> lines giving the location and size of the line's text in the buffer.

In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif,
...}
along with text from different fonts and different backgrounds on a per
character basis.

> If a line is modified, it can be reallocated at the end of the buffer, 
> and if the buffer gets full, it can be "repacked" and/or expanded as 
> needed. When written back to a file, the buffer lines can be emitted 
> in-order to the text file.

> Not entirely sure how other text editors manage things here, not really
> 
> looked into it.

If you think about it with the above features, you quickly realize it
is not just text anymore.


>> - anton