Deutsch English Français Italiano |
<v3v7k7$24548$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen <terje.mathisen@tmsw.no> Newsgroups: comp.arch Subject: Re: Byte Addressability And Beyond Date: Fri, 7 Jun 2024 17:05:42 +0200 Organization: A noiseless patient Spider Lines: 46 Message-ID: <v3v7k7$24548$1@dont-email.me> References: <v0s17o$2okf4$2@dont-email.me> <v31c4r$3u28v$1@dont-email.me> <v327n3$1use$1@gal.iecc.com> <BM25O.40665$HBac.4762@fx15.iad> <v32lpv$1u25$1@gal.iecc.com> <v33bqg$9cst$11@dont-email.me> <v34v62$ln01$1@dont-email.me> <v36bva$10k3v$2@dont-email.me> <2024May29.090435@mips.complang.tuwien.ac.at> <cIG5O.25483$gKW1.4042@fx13.iad> <jwvcyp4veqj.fsf-monnier+comp.arch@gnu.org> <I5I5O.9419$czG6.9020@fx02.iad> <jwv1q5kvcnm.fsf-monnier+comp.arch@gnu.org> <1uJ5O.2$gn%7.1@fx12.iad> <2024May30.173537@mips.complang.tuwien.ac.at> <pbI6O.19524$61Y8.11175@fx15.iad> <jwv7cf4mpug.fsf-monnier+comp.arch@gnu.org> <cKE8O.2$bR_f.1@fx07.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Injection-Date: Fri, 07 Jun 2024 17:05:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="2ae1113e35663d5bd33dd38d87f62943"; logging-data="2233480"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+JZf0Pw6lzTYhXT7fxWH+DkgzRrEXqOXia/o9/YgTD1A==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:91.0) Gecko/20100101 Firefox/91.0 SeaMonkey/2.53.18.2 Cancel-Lock: sha1:hsW0eTJXfVwRRK3XsHZquo5nm3Q= In-Reply-To: <cKE8O.2$bR_f.1@fx07.iad> Bytes: 3607 EricP wrote: > Stefan Monnier wrote: >> >> Another issue with Unicode is the so-called "confusables": things that= >> may look identical (or close enough) on screen yet are different (and >> not just because of normalization).=C2=A0 E.g. =C3=8E=E2=80=99 vs B, =C3= =90=C2=90 vs A, or =C3=A2=CB=86=E2=80=A2 vs=20 >> / vs =C3=A2=C2=81=E2=80=9E. >> Unicode comes with a 700kB `confusables.txt` listing such issues. >=20 > Eeewww... I didn't even think of that. > What does one do about them? You can't treat them as equivalent in a > string compare... the user might want the first B and not second B. >=20 > I suppose one would want two compare equal functions, > an exactly equal, and a visually approximately equal. > Like using a soundex for words to catch misspellings. >=20 > But then programmers need to decide when to use each compare. >=20 > These character and code attribute lookup tables are looking awkward. > With up to 2M codes, and some base character codes having multiple > possible combiners, but very sparse. And links between entries > for upper and lower case, and now links between confusables. > And we don't want to roll over the L1 cache just to do a string compare= =2E Years ago I considered case-insensitive Boyer-Moore text search with a=20 wide alphabet and found that the only approach that made sense was to=20 maintain two copies of the string to be searched for, one lower and one=20 upper case, where each "character" was a length-encoded string. This was = required to handle things like the German double s which can uppercase=20 into a single letter. The lookup table for skip lengths was still far shorter than the=20 alphabet size, effectively a very short and fast hash of the current=20 character/codepoint/combined letter. Terje --=20 - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"