Article <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>

Deutsch English Français Italiano

<jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: comp.arch
Subject: Re: Byte Addressability And Beyond
Date: Tue, 04 Jun 2024 16:28:00 -0400
Organization: A noiseless patient Spider
Lines: 25
Message-ID: <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org>
References: <v0s17o$2okf4$2@dont-email.me> <v31c4r$3u28v$1@dont-email.me>
	<v327n3$1use$1@gal.iecc.com> <BM25O.40665$HBac.4762@fx15.iad>
	<v32lpv$1u25$1@gal.iecc.com> <v33bqg$9cst$11@dont-email.me>
	<v34v62$ln01$1@dont-email.me> <v36bva$10k3v$2@dont-email.me>
	<2024May29.090435@mips.complang.tuwien.ac.at>
	<cIG5O.25483$gKW1.4042@fx13.iad>
	<jwvcyp4veqj.fsf-monnier+comp.arch@gnu.org>
	<I5I5O.9419$czG6.9020@fx02.iad>
	<jwv1q5kvcnm.fsf-monnier+comp.arch@gnu.org> <1uJ5O.2$gn%7.1@fx12.iad>
	<2024May30.173537@mips.complang.tuwien.ac.at>
	<pbI6O.19524$61Y8.11175@fx15.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 04 Jun 2024 22:28:07 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a7a1d6ef33a3325082a534d5b628fdcc";
	logging-data="612847"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19NsOe9HkaPA+vzUX5j2iXWbJFHQbkmHvc="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:Gqurr7O2HM1CmKOkzjszHD79cys=
	sha1:lgtJiZwImSsb6NyazFpiR0udZFI=
Bytes: 2728

> If I want to validate combiner codes or normalize characters I need
> UTF-32 because I have to work with the whole character as a unit.

You can read the code points directly from the UTF-8 sequence almost
as easily as you can from a UTF-32 sequence.
Most of the cost will be in the memory accesses and then in looking up the
various tables to decide how to normalize or whether it's valid, so the
difference between reading the info from UTF-32 or UTF-8 should be lost in
the noise.
UTF-32 might be marginally faster at this specific operation in some
cases (definitely not if your text is mostly ASCII), but I'd be very
surprised if the difference is ever large enough to pay for a conversion
from UTF-8 to UTF-32.

> I was just trying to get people thinking of ways that malformed
> characters might be used to bypass other validation checks in
> their software.

Another issue with Unicode is the so-called "confusables": things that
may look identical (or close enough) on screen yet are different (and
not just because of normalization).  E.g. Β vs B, А vs A, or ∕ vs / vs ⁄.
Unicode comes with a 700kB `confusables.txt` listing such issues.

        Stefan