Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: Rationale for aligning data on even bytes in a Unix shell file?
Date: Thu, 8 May 2025 18:50:33 -0500
Organization: A noiseless patient Spider
Lines: 80
Message-ID: <vvjg9u$28sh0$1@dont-email.me>
References: <vuih43$2agfa$1@dont-email.me> <vunbgo$2q5u8$1@dont-email.me>
 <vunbjg$2q72n$1@raubtier-asyl.eternal-september.org>
 <vund1f$2rh3j$1@dont-email.me>
 <vungko$2uoa2$1@raubtier-asyl.eternal-september.org>
 <X9MPP.1383458$f81.819466@fx48.iad>
 <vuobri$3o38b$1@raubtier-asyl.eternal-september.org>
 <XtOPP.2986761$t84d.2537581@fx11.iad>
 <vuohq9$3tlhf$1@raubtier-asyl.eternal-september.org>
 <vuoig5$3ub4j$1@dont-email.me>
 <vuorpf$6tnn$1@raubtier-asyl.eternal-september.org>
 <vup2nt$bi1k$2@dont-email.me>
 <vupofl$13pg2$2@raubtier-asyl.eternal-september.org>
 <vuprce$15sqo$2@dont-email.me>
 <vvd6n5$353gs$1@raubtier-asyl.eternal-september.org>
 <vvfbnj$ulpc$1@dont-email.me> <vvflec$11b72$1@dont-email.me>
 <20250507202430.00005bb9@yahoo.com> <vvh8qg$1ha26$2@dont-email.me>
 <vvi3k6$1o09d$1@dont-email.me> <vvj3qe$246ff$1@dont-email.me>
 <87v7qaerg8.fsf@nosuchdomain.example.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 09 May 2025 01:55:43 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="4eb7757fd2ee7830e59c36b37f7920e2";
	logging-data="2388512"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19FUtCwo1b9OkoDU4+ZLkEPZjdxC3B6I9M="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sDst+VWaIMoSZSav606vkYQ1XY8=
Content-Language: en-US
In-Reply-To: <87v7qaerg8.fsf@nosuchdomain.example.com>
Bytes: 4810

On 5/8/2025 4:13 PM, Keith Thompson wrote:
> BGB <cr88192@gmail.com> writes:
>> On 5/8/2025 6:13 AM, Janis Papanagnou wrote:
>>> On 08.05.2025 05:30, BGB wrote:
>>>> [...]
>>>>
>>>> Though, even for the Latin alphabet, once one goes much outside of ASCII
>>>> and Latin-1, it gets messy.
>>> I noticed that in several places you were referring to
>>> Latin-1. Since
>>> decades that has been replaced by the Latin-9 (ISO 8859-15) character
>>> set[*] for practical reasons ('€' sign, for example).
>>> Why is your focus still on the old Latin-1 (ISO 8859-1) character
>>> set?
>>> Janis, just curious
>>> [*] Unless Unicode and its encodings are used.
>>>
>>
>> U+00A0..U+00FF are designated as Latin-1 in Unicode.
> 
> I don't think that's accurate.  Do you have a reference for that?

https://en.wikipedia.org/wiki/Latin-1_Supplement

Would seem to somewhat imply that this range of codepoints is known as 
Latin-1...


> It's true that those characters have the same names in Unicode
> as in Latin-1.  Though the Wikipedia article says that the ranges
> 0x00..0x1F and 0x7F..0x9F are *undefined*.  (That doesn't match my
> recollection; I thought they were defined as control characters.)
> 

0000..001F, usually understood as C0 control codes.

0080..009F, usually understood as C1 control codes.

But, I don't bother with C1 control codes, as they are unused, and 
interpreting them as aliases for the other characters that appear in 
1252 is more useful, and seemingly not entirely unorhodox.

> In any case, Latin-1 and Latin-9 treat those ranges in the same way.
> Both can be seen as encodings for small subsets of Unicode.
> 

Latin-9 does not exactly match up with U+00A0..U+00FF though, whereas 
for Latin-1, it does match up.



> [...]
> 
>> CP-1252, is the dominant remaining ASCII character set in use, is
>> based on Latin-1, with a few characters from Latin-15 shoved into the
>> places where control codes previously went.
> 
> CP-1252 is not an ASCII character set.  ASCII is a 7-bit character set.
> CP-1252 is an 8-bit character set as are the Latin-* sets.  Most 8-bit
> sets are *based on* ASCII.
> 

It is 8-bit and byte-based, and informally I think, most extended-ASCII 
codepages were collectively known as ASCII even if only the low 7-bit 
range is ASCII proper (and I think more for sake of contrast with "Not 
Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...).

But, say, we know it is CP-1252 and not, say, CP-437 or KOI8-R or 
similar; which is the main relevant part.


In some contexts, may or may not also have ANSI escape sequences, though 
generally no text editors deal with or make use of ANSI escapes.

Could almost make sense though if one wanted, say, a word-processor 
whose format was a direct superset of normal text files rather than some 
sort of specialized or proprietary format.