Deutsch English Français Italiano |
<jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Stefan Monnier <monnier@iro.umontreal.ca> Newsgroups: comp.arch Subject: Re: Byte Addressability And Beyond Date: Tue, 04 Jun 2024 16:28:00 -0400 Organization: A noiseless patient Spider Lines: 25 Message-ID: <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org> References: <v0s17o$2okf4$2@dont-email.me> <v31c4r$3u28v$1@dont-email.me> <v327n3$1use$1@gal.iecc.com> <BM25O.40665$HBac.4762@fx15.iad> <v32lpv$1u25$1@gal.iecc.com> <v33bqg$9cst$11@dont-email.me> <v34v62$ln01$1@dont-email.me> <v36bva$10k3v$2@dont-email.me> <2024May29.090435@mips.complang.tuwien.ac.at> <cIG5O.25483$gKW1.4042@fx13.iad> <jwvcyp4veqj.fsf-monnier+comp.arch@gnu.org> <I5I5O.9419$czG6.9020@fx02.iad> <jwv1q5kvcnm.fsf-monnier+comp.arch@gnu.org> <1uJ5O.2$gn%7.1@fx12.iad> <2024May30.173537@mips.complang.tuwien.ac.at> <pbI6O.19524$61Y8.11175@fx15.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Tue, 04 Jun 2024 22:28:07 +0200 (CEST) Injection-Info: dont-email.me; posting-host="a7a1d6ef33a3325082a534d5b628fdcc"; logging-data="612847"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19NsOe9HkaPA+vzUX5j2iXWbJFHQbkmHvc=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:Gqurr7O2HM1CmKOkzjszHD79cys= sha1:lgtJiZwImSsb6NyazFpiR0udZFI= Bytes: 2728 > If I want to validate combiner codes or normalize characters I need > UTF-32 because I have to work with the whole character as a unit. You can read the code points directly from the UTF-8 sequence almost as easily as you can from a UTF-32 sequence. Most of the cost will be in the memory accesses and then in looking up the various tables to decide how to normalize or whether it's valid, so the difference between reading the info from UTF-32 or UTF-8 should be lost in the noise. UTF-32 might be marginally faster at this specific operation in some cases (definitely not if your text is mostly ASCII), but I'd be very surprised if the difference is ever large enough to pay for a conversion from UTF-8 to UTF-32. > I was just trying to get people thinking of ways that malformed > characters might be used to bypass other validation checks in > their software. Another issue with Unicode is the so-called "confusables": things that may look identical (or close enough) on screen yet are different (and not just because of normalization). E.g. Β vs B, А vs A, or ∕ vs / vs ⁄. Unicode comes with a 700kB `confusables.txt` listing such issues. Stefan