Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Keith Thompson <Keith.S.Thompson+u@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: Rationale for aligning data on even bytes in a Unix shell file?
Date: Thu, 08 May 2025 17:19:37 -0700
Organization: None to speak of
Lines: 85
Message-ID: <87wmaqd49y.fsf@nosuchdomain.example.com>
References: <vuih43$2agfa$1@dont-email.me> <vund1f$2rh3j$1@dont-email.me>
	<vungko$2uoa2$1@raubtier-asyl.eternal-september.org>
	<X9MPP.1383458$f81.819466@fx48.iad>
	<vuobri$3o38b$1@raubtier-asyl.eternal-september.org>
	<XtOPP.2986761$t84d.2537581@fx11.iad>
	<vuohq9$3tlhf$1@raubtier-asyl.eternal-september.org>
	<vuoig5$3ub4j$1@dont-email.me>
	<vuorpf$6tnn$1@raubtier-asyl.eternal-september.org>
	<vup2nt$bi1k$2@dont-email.me>
	<vupofl$13pg2$2@raubtier-asyl.eternal-september.org>
	<vuprce$15sqo$2@dont-email.me>
	<vvd6n5$353gs$1@raubtier-asyl.eternal-september.org>
	<vvfbnj$ulpc$1@dont-email.me> <vvflec$11b72$1@dont-email.me>
	<20250507202430.00005bb9@yahoo.com> <vvh8qg$1ha26$2@dont-email.me>
	<vvi3k6$1o09d$1@dont-email.me> <vvj3qe$246ff$1@dont-email.me>
	<87v7qaerg8.fsf@nosuchdomain.example.com>
	<vvjg9u$28sh0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 09 May 2025 02:19:38 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="302a6dd640940106301f9e87fdade96e";
	logging-data="2375298"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/c4bJwNK8cRhpyLYSZWjjo"
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:9bQiStzSvmgPFicKWUgW6kEwZg0=
	sha1:pka/Xe0OY3OHH5o1HyQMNfloKNk=

BGB <cr88192@gmail.com> writes:
> On 5/8/2025 4:13 PM, Keith Thompson wrote:
>> BGB <cr88192@gmail.com> writes:
>>> On 5/8/2025 6:13 AM, Janis Papanagnou wrote:
>>>> On 08.05.2025 05:30, BGB wrote:
>>>>> [...]
>>>>>
>>>>> Though, even for the Latin alphabet, once one goes much outside of ASCII
>>>>> and Latin-1, it gets messy.
>>>> I noticed that in several places you were referring to
>>>> Latin-1. Since
>>>> decades that has been replaced by the Latin-9 (ISO 8859-15) character
>>>> set[*] for practical reasons ('€' sign, for example).
>>>> Why is your focus still on the old Latin-1 (ISO 8859-1) character
>>>> set?
>>>> Janis, just curious
>>>> [*] Unless Unicode and its encodings are used.
>>>
>>> U+00A0..U+00FF are designated as Latin-1 in Unicode.
>> I don't think that's accurate.  Do you have a reference for that?
>
> https://en.wikipedia.org/wiki/Latin-1_Supplement
>
> Would seem to somewhat imply that this range of codepoints is known as
> Latin-1...

The article says that range is called the "Latin-1 Supplement" (I
didn't know that).  But it's a supplement derived from a *subset*
of the Latin-1 character set.  Latin-1 itself is an 8-bit character
set representing 256 distinct characters (and matching ASCII for
0x00 to 0x7f).  The "Latin-1 Supplement" is just U+0080 to U+00FF,
128 characters.  The supplement consists of 64 obscure control
characters and 64 printable characters.

The Latin-1 8-bit character set is largely obsolete.  Whatever point
you're making, I suspect you could make it much more clearly without
any reference to Latin-1 or Windows-1252.

I think the issue that led to this discussion was how to define case
mapping for case-insensitive file systems.  My personal preference is to
use case-sensitive file systems, but that's not always an option.

NTFS is case-insensitive by default, which means it has to have rules
for mapping lowercase to uppercase and vice versa, and for determining
whether two distinct character values are "the same" ('a' and 'A', for
example).  We could discuss at great length how NTFS *should* do this,
but surely that determination has already been made in the definition of
NFTS.  (I don't know what the rules are.)

>> It's true that those characters have the same names in Unicode
>> as in Latin-1.  Though the Wikipedia article says that the ranges
>> 0x00..0x1F and 0x7F..0x9F are *undefined*.  (That doesn't match my
>> recollection; I thought they were defined as control characters.)
>> 
>
> 0000..001F, usually understood as C0 control codes.
>
> 0080..009F, usually understood as C1 control codes.
>
> But, I don't bother with C1 control codes, as they are unused, and
> interpreting them as aliases for the other characters that appear in
> 1252 is more useful, and seemingly not entirely unorhodox.

The Windows 1252 character set, yet another 8-bit extension to 7-bit
ASCII, assigns printable characters to the range 0x80 to 0x9f (with some
gaps), where both Latin-1 and Unicode have obscure control characters.
But all those printable characters have Unicode code points.

If you want to use Windows-1252 or Latin-1, you can do that, but surely
just using Unicode (preferably with a UTF-8 encoding) is going to cause
fewer problems.

[...]

> It is 8-bit and byte-based, and informally I think, most
> extended-ASCII codepages were collectively known as ASCII even if only
> the low 7-bit range is ASCII proper (and I think more for sake of
> contrast with "Not Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...).

No, 8-bit character sets are not ASCII.  Calling them "extended ASCII"
is reasonable.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */