Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.lang.c Subject: Re: Rationale for aligning data on even bytes in a Unix shell file? Date: Thu, 8 May 2025 18:50:33 -0500 Organization: A noiseless patient Spider Lines: 80 Message-ID: References: <20250507202430.00005bb9@yahoo.com> <87v7qaerg8.fsf@nosuchdomain.example.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Fri, 09 May 2025 01:55:43 +0200 (CEST) Injection-Info: dont-email.me; posting-host="4eb7757fd2ee7830e59c36b37f7920e2"; logging-data="2388512"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19FUtCwo1b9OkoDU4+ZLkEPZjdxC3B6I9M=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:sDst+VWaIMoSZSav606vkYQ1XY8= Content-Language: en-US In-Reply-To: <87v7qaerg8.fsf@nosuchdomain.example.com> Bytes: 4810 On 5/8/2025 4:13 PM, Keith Thompson wrote: > BGB writes: >> On 5/8/2025 6:13 AM, Janis Papanagnou wrote: >>> On 08.05.2025 05:30, BGB wrote: >>>> [...] >>>> >>>> Though, even for the Latin alphabet, once one goes much outside of ASCII >>>> and Latin-1, it gets messy. >>> I noticed that in several places you were referring to >>> Latin-1. Since >>> decades that has been replaced by the Latin-9 (ISO 8859-15) character >>> set[*] for practical reasons ('€' sign, for example). >>> Why is your focus still on the old Latin-1 (ISO 8859-1) character >>> set? >>> Janis, just curious >>> [*] Unless Unicode and its encodings are used. >>> >> >> U+00A0..U+00FF are designated as Latin-1 in Unicode. > > I don't think that's accurate. Do you have a reference for that? https://en.wikipedia.org/wiki/Latin-1_Supplement Would seem to somewhat imply that this range of codepoints is known as Latin-1... > It's true that those characters have the same names in Unicode > as in Latin-1. Though the Wikipedia article says that the ranges > 0x00..0x1F and 0x7F..0x9F are *undefined*. (That doesn't match my > recollection; I thought they were defined as control characters.) > 0000..001F, usually understood as C0 control codes. 0080..009F, usually understood as C1 control codes. But, I don't bother with C1 control codes, as they are unused, and interpreting them as aliases for the other characters that appear in 1252 is more useful, and seemingly not entirely unorhodox. > In any case, Latin-1 and Latin-9 treat those ranges in the same way. > Both can be seen as encodings for small subsets of Unicode. > Latin-9 does not exactly match up with U+00A0..U+00FF though, whereas for Latin-1, it does match up. > [...] > >> CP-1252, is the dominant remaining ASCII character set in use, is >> based on Latin-1, with a few characters from Latin-15 shoved into the >> places where control codes previously went. > > CP-1252 is not an ASCII character set. ASCII is a 7-bit character set. > CP-1252 is an 8-bit character set as are the Latin-* sets. Most 8-bit > sets are *based on* ASCII. > It is 8-bit and byte-based, and informally I think, most extended-ASCII codepages were collectively known as ASCII even if only the low 7-bit range is ASCII proper (and I think more for sake of contrast with "Not Unicode", eg, UTF-8 / UTF-16 / UCS-2 / ...). But, say, we know it is CP-1252 and not, say, CP-437 or KOI8-R or similar; which is the main relevant part. In some contexts, may or may not also have ANSI escape sequences, though generally no text editors deal with or make use of ANSI escapes. Could almost make sense though if one wanted, say, a word-processor whose format was a direct superset of normal text files rather than some sort of specialized or proprietary format.