Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: Rationale for aligning data on even bytes in a Unix shell file? Date: Thu, 08 May 2025 17:19:37 -0700 Organization: None to speak of Lines: 85 Message-ID: <87wmaqd49y.fsf@nosuchdomain.example.com> References: <20250507202430.00005bb9@yahoo.com> <87v7qaerg8.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Fri, 09 May 2025 02:19:38 +0200 (CEST) Injection-Info: dont-email.me; posting-host="302a6dd640940106301f9e87fdade96e"; logging-data="2375298"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/c4bJwNK8cRhpyLYSZWjjo" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:9bQiStzSvmgPFicKWUgW6kEwZg0= sha1:pka/Xe0OY3OHH5o1HyQMNfloKNk= BGB writes: > On 5/8/2025 4:13 PM, Keith Thompson wrote: >> BGB writes: >>> On 5/8/2025 6:13 AM, Janis Papanagnou wrote: >>>> On 08.05.2025 05:30, BGB wrote: >>>>> [...] >>>>> >>>>> Though, even for the Latin alphabet, once one goes much outside of ASCII >>>>> and Latin-1, it gets messy. >>>> I noticed that in several places you were referring to >>>> Latin-1. Since >>>> decades that has been replaced by the Latin-9 (ISO 8859-15) character >>>> set[*] for practical reasons ('€' sign, for example). >>>> Why is your focus still on the old Latin-1 (ISO 8859-1) character >>>> set? >>>> Janis, just curious >>>> [*] Unless Unicode and its encodings are used. >>> >>> U+00A0..U+00FF are designated as Latin-1 in Unicode. >> I don't think that's accurate. Do you have a reference for that? > > https://en.wikipedia.org/wiki/Latin-1_Supplement > > Would seem to somewhat imply that this range of codepoints is known as > Latin-1... The article says that range is called the "Latin-1 Supplement" (I didn't know that). But it's a supplement derived from a *subset* of the Latin-1 character set. Latin-1 itself is an 8-bit character set representing 256 distinct characters (and matching ASCII for 0x00 to 0x7f). The "Latin-1 Supplement" is just U+0080 to U+00FF, 128 characters. The supplement consists of 64 obscure control characters and 64 printable characters. The Latin-1 8-bit character set is largely obsolete. Whatever point you're making, I suspect you could make it much more clearly without any reference to Latin-1 or Windows-1252. I think the issue that led to this discussion was how to define case mapping for case-insensitive file systems. My personal preference is to use case-sensitive file systems, but that's not always an option. NTFS is case-insensitive by default, which means it has to have rules for mapping lowercase to uppercase and vice versa, and for determining whether two distinct character values are "the same" ('a' and 'A', for example). We could discuss at great length how NTFS *should* do this, but surely that determination has already been made in the definition of NFTS. (I don't know what the rules are.) >> It's true that those characters have the same names in Unicode >> as in Latin-1. Though the Wikipedia article says that the ranges >> 0x00..0x1F and 0x7F..0x9F are *undefined*. (That doesn't match my >> recollection; I thought they were defined as control characters.) >> > > 0000..001F, usually understood as C0 control codes. > > 0080..009F, usually understood as C1 control codes. > > But, I don't bother with C1 control codes, as they are unused, and > interpreting them as aliases for the other characters that appear in > 1252 is more useful, and seemingly not entirely unorhodox. The Windows 1252 character set, yet another 8-bit extension to 7-bit ASCII, assigns printable characters to the range 0x80 to 0x9f (with some gaps), where both Latin-1 and Unicode have obscure control characters. But all those printable characters have Unicode code points. If you want to use Windows-1252 or Latin-1, you can do that, but surely just using Unicode (preferably with a UTF-8 encoding) is going to cause fewer problems. [...] > It is 8-bit and byte-based, and informally I think, most > extended-ASCII codepages were collectively known as ASCII even if only > the low 7-bit range is ASCII proper (and I think more for sake of > contrast with "Not Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...). No, 8-bit character sets are not ASCII. Calling them "extended ASCII" is reasonable. -- Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com void Void(void) { Void(); } /* The recursive call of the void */