Deutsch English Français Italiano |
<v3d0hj$2amga$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Unicode in strings Date: Fri, 31 May 2024 12:14:19 -0500 Organization: A noiseless patient Spider Lines: 44 Message-ID: <v3d0hj$2amga$1@dont-email.me> References: <v0s17o$2okf4$2@dont-email.me> <2024May11.173149@mips.complang.tuwien.ac.at> <v1preb$2jn47$1@dont-email.me> <2024May12.110053@mips.complang.tuwien.ac.at> <jwvjzjwid50.fsf-monnier+comp.arch@gnu.org> <2024May18.072920@mips.complang.tuwien.ac.at> <jwved9t656u.fsf-monnier+comp.arch@gnu.org> <2024May25.174807@mips.complang.tuwien.ac.at> <jwvy17ty8v7.fsf-monnier+comp.arch@gnu.org> <2024May29.085955@mips.complang.tuwien.ac.at> <jwv5xuwwuqe.fsf-monnier+comp.arch@gnu.org> <2024May30.182546@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 31 May 2024 19:14:28 +0200 (CEST) Injection-Info: dont-email.me; posting-host="6ea1dc31a293772695e7d714cf6f6549"; logging-data="2447882"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/2TyZY+mi7+gaizhLojwaSKND4leB7b6E=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:WixLjhBA7QpMOtGvxA+0eMkUczg= In-Reply-To: <2024May30.182546@mips.complang.tuwien.ac.at> Content-Language: en-US Bytes: 3525 On 5/30/2024 11:25 AM, Anton Ertl wrote: > Stefan Monnier <monnier@iro.umontreal.ca> writes: >> I'm not sure the codepoint-oriented API is the best option, but it's not >> completely clear what *is* the best option. You mention a byte-oriented >> API and you might be right that it's a better option, but in the case of >> Emacs that's what we used in Emacs-20.1 but it worked really poorly >> because of backward compatibility issues. I think if we started from >> scratch now (i.e. without having to contend with backward compatibility, >> and with a better understanding of Unicode (which barely existed back >> then)) it might work better, indeed, but that's not been an option > > Plus, editors are among the very few uses where you have to deal with > individual characters, so the "treat it as opaque string" approach > that works so well for most other code is not good enough there. The > command-line editor of Gforth is one case where we use the xchar words > (those for dealing with code points of UTF-8). > Yeah. For text editors, this is one of the few cases it makes sense to use 32 or 64 bit characters (say, combining the 'character' with some additional metadata such as formatting). Though, one thing that makes sense for text editors is if only the "currently being edited" lines are fully unpacked, whereas the others can remain in a more compact form (such as UTF-8), and are then unpacked as they come into view (say, treating the editor window as a 32-entry modulo cache or similar). For the rest, say, one can have, say, a big buffer, with an array of lines giving the location and size of the line's text in the buffer. If a line is modified, it can be reallocated at the end of the buffer, and if the buffer gets full, it can be "repacked" and/or expanded as needed. When written back to a file, the buffer lines can be emitted in-order to the text file. Not entirely sure how other text editors manage things here, not really looked into it. > - anton