Deutsch English Français Italiano |
<v1kifk$17qh0$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: David Brown <david.brown@hesbynett.no> Newsgroups: comp.arch Subject: Re: Byte Addressability And Beyond Date: Fri, 10 May 2024 09:31:00 +0200 Organization: A noiseless patient Spider Lines: 33 Message-ID: <v1kifk$17qh0$1@dont-email.me> References: <v0s17o$2okf4$2@dont-email.me> <4e0557bec2acda4df76f1ed01ebcbdf6@www.novabbs.org> <v1ep4i$1ptf$1@gal.iecc.com> <v1eruj$3o1r8$1@dont-email.me> <v1h8l6$1ttd$1@gal.iecc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Fri, 10 May 2024 09:31:01 +0200 (CEST) Injection-Info: dont-email.me; posting-host="633d7e186bd79b4c56fa5f1c4cde0101"; logging-data="1305120"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18X9s4/HXOOK9Nut5Ct0sSa1rSNIzmrFLQ=" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.11.0 Cancel-Lock: sha1:2r3fH6yOjv9iIilKZjNN8PM+VUU= In-Reply-To: <v1h8l6$1ttd$1@gal.iecc.com> Content-Language: en-GB Bytes: 2506 On 09/05/2024 03:24, John Levine wrote: > According to Lawrence D'Oliveiro <ldo@nz.invalid>: >> On Wed, 8 May 2024 02:47:46 -0000 (UTC), John Levine wrote: >> >>> It doesn't make sense to say that character strings are big- or little- >>> endian. >> >> Yes it does, for just about any encoding other than UTF-8. Thus, you have >> UTF16BE, and UTF16LE. > > Not really, those are byte orders within a character, not order of characters. > Or rather, they are byte orders used by different encodings of code points. ("Characters" in Unicode are more complicated - nothing is ever simple in Unicode!) There are no endian issues between code points, and a "string" as far as Unicode is concerned would be a sequence of code points. You only have endian issues if you want to store these 21-bit integers in a format that is encoded in smaller lumps (like byte-addressed memory). > If you look at surrogates, you can UTF16 is big-endian. First there's the high > surrogate, then the low one. > > There's a reason that every encoding other than UTF-8 is dead. Who needs the grief? Indeed. UTF-32 is fine for internal use, however - using whatever endianness your processor prefers. The trick is never to let it leave the one computer in any encoding other than UTF-8.