Path: ...!weretis.net!feeder9.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Unicode in strings Date: Fri, 31 May 2024 17:21:53 +0000 Organization: Rocksolid Light Message-ID: <5db8e3e2060c479d61d05cfad35d7701@www.novabbs.org> References: <2024May11.173149@mips.complang.tuwien.ac.at> <2024May12.110053@mips.complang.tuwien.ac.at> <2024May18.072920@mips.complang.tuwien.ac.at> <2024May25.174807@mips.complang.tuwien.ac.at> <2024May29.085955@mips.complang.tuwien.ac.at> <2024May30.182546@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="2786343"; mail-complaints-to="usenet@i2pn2.org"; posting-account="65wTazMNTleAJDh/pRqmKE7ADni/0wesT78+pyiDW8A"; User-Agent: Rocksolid Light X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$Rz9wNGJGccTgkLdXLWcmSuzfWbneHRNanw91FQbFKnoCLvZbEliQi Bytes: 3821 Lines: 60 BGB wrote: > On 5/30/2024 11:25 AM, Anton Ertl wrote: >> Stefan Monnier writes: >>> I'm not sure the codepoint-oriented API is the best option, but it's >>> not >>> completely clear what *is* the best option. You mention a >>> byte-oriented >>> API and you might be right that it's a better option, but in the case >>> of >>> Emacs that's what we used in Emacs-20.1 but it worked really poorly >>> because of backward compatibility issues. I think if we started from >>> scratch now (i.e. without having to contend with backward >>> compatibility, >>> and with a better understanding of Unicode (which barely existed back >>> then)) it might work better, indeed, but that's not been an option >> >> Plus, editors are among the very few uses where you have to deal with >> individual characters, so the "treat it as opaque string" approach >> that works so well for most other code is not good enough there. The >> command-line editor of Gforth is one case where we use the xchar words >> (those for dealing with code points of UTF-8). >> > Yeah. > For text editors, this is one of the few cases it makes sense to use 32 > > or 64 bit characters (say, combining the 'character' with some > additional metadata such as formatting). > Though, one thing that makes sense for text editors is if only the > "currently being edited" lines are fully unpacked, whereas the others > can remain in a more compact form (such as UTF-8), and are then > unpacked > > as they come into view (say, treating the editor window as a 32-entry > modulo cache or similar). > For the rest, say, one can have, say, a big buffer, with an array of > lines giving the location and size of the line's text in the buffer. In a modern text editor, one can paste in {*.xls tables, *.jpg, *.gif, ...} along with text from different fonts and different backgrounds on a per character basis. > If a line is modified, it can be reallocated at the end of the buffer, > and if the buffer gets full, it can be "repacked" and/or expanded as > needed. When written back to a file, the buffer lines can be emitted > in-order to the text file. > Not entirely sure how other text editors manage things here, not really > > looked into it. If you think about it with the above features, you quickly realize it is not just text anymore. >> - anton