Article <vvj3qe$246ff$1@dont-email.me>

Deutsch English Français Italiano
<vvj3qe$246ff$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: Rationale for aligning data on even bytes in a Unix shell file?
Date: Thu, 8 May 2025 15:17:30 -0500
Organization: A noiseless patient Spider
Lines: 145
Message-ID: <vvj3qe$246ff$1@dont-email.me>
References: <vuih43$2agfa$1@dont-email.me> <vuml73$1riea$1@dont-email.me>
 <vun04h$2fjrn$2@raubtier-asyl.eternal-september.org>
 <vun1nh$22hc5$3@dont-email.me>
 <vunak2$2p980$1@raubtier-asyl.eternal-september.org>
 <vunbgo$2q5u8$1@dont-email.me>
 <vunbjg$2q72n$1@raubtier-asyl.eternal-september.org>
 <vund1f$2rh3j$1@dont-email.me>
 <vungko$2uoa2$1@raubtier-asyl.eternal-september.org>
 <X9MPP.1383458$f81.819466@fx48.iad>
 <vuobri$3o38b$1@raubtier-asyl.eternal-september.org>
 <XtOPP.2986761$t84d.2537581@fx11.iad>
 <vuohq9$3tlhf$1@raubtier-asyl.eternal-september.org>
 <vuoig5$3ub4j$1@dont-email.me>
 <vuorpf$6tnn$1@raubtier-asyl.eternal-september.org>
 <vup2nt$bi1k$2@dont-email.me>
 <vupofl$13pg2$2@raubtier-asyl.eternal-september.org>
 <vuprce$15sqo$2@dont-email.me>
 <vvd6n5$353gs$1@raubtier-asyl.eternal-september.org>
 <vvfbnj$ulpc$1@dont-email.me> <vvflec$11b72$1@dont-email.me>
 <20250507202430.00005bb9@yahoo.com> <vvh8qg$1ha26$2@dont-email.me>
 <vvi3k6$1o09d$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Thu, 08 May 2025 22:22:39 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2ba0e0bff72c1f798f59c15520abdc28";
	logging-data="2234863"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/rvYrwDlJZG3kBtWE+d5j0PMjhX487JT4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:oicj0Xxgacl+v23iPKcuAjv/2cQ=
In-Reply-To: <vvi3k6$1o09d$1@dont-email.me>
Content-Language: en-US

On 5/8/2025 6:13 AM, Janis Papanagnou wrote:
> On 08.05.2025 05:30, BGB wrote:
>> [...]
>>
>> Though, even for the Latin alphabet, once one goes much outside of ASCII
>> and Latin-1, it gets messy.
> 
> I noticed that in several places you were referring to Latin-1. Since
> decades that has been replaced by the Latin-9 (ISO 8859-15) character
> set[*] for practical reasons ('€' sign, for example).
> 
> Why is your focus still on the old Latin-1 (ISO 8859-1) character set?
> 
> Janis, just curious
> 
> [*] Unless Unicode and its encodings are used.
> 

U+00A0..U+00FF are designated as Latin-1 in Unicode.

There are further Latin blocks in Unicode, but the characters are more 
haphazard, so any rules defined are more likely to operate one character 
at a time rather than moving whole blocks of characters (as in ASCII and 
the Latin-1 range).


CP-1252, is the dominant remaining ASCII character set in use, is based 
on Latin-1, with a few characters from Latin-15 shoved into the places 
where control codes previously went.

Say, euro mark (U+20AC) located at 80, Y with diaeresis (U+0178) located 
at 9F, ...

Apparently, in some online stats, only 0.02% of webpages use 8859-15 (vs 
1.1% for 8859-1).


In my project, as noted, the Unicode mapping was tweaked in that 
0080..009F are understood as the 1252 mappings, effectively leaving the 
C1 control codes as N/E, but the C1 control codes are pretty much unused 
in practice.

And, of the C0 control codes, only a subset of them can be considered 
"actually" used:
   \0, \a, \b, \t, \r, \r, \e  (known used, also have escape notations)
   \v, \f  (have C escapes, pretty much never encountered though).

In text files, it is usually reduced to:
   \t, \r, \n


In this case, it means that the conversion between UTF-8 ans 1252 is 
fairly straightforward.
   1252 -> UTF-8, simply remap anything in 80..FF into a 2-byte encoding.
   UTF-8 -> 1252, remap 0000..00FF to bytes;
     Potentially detect/reject if characters outside the range are used;
     Some canonical Unicode characters mapped to 1252 range if possible.


Can further note:
   00..FF:
     Can also be represented in 6x8 font cells
       Experimental GUI uses 6x8 for the console.
         So, needs 480x200 pixels.
     In addition to the 8x8 cells.
       80..FF needed some twiddling in some cases to fit in 8x8.
       Would have been easier with 8x12 cells or similar, but...

The 2-digit hexadecimal can be represented effectively in 8x8, but not 
so well in 6x8, as we generally need 4x5 pixels for each hex digit. At 
6x8, one has to leave out the space pixels, so the digits collide, 
negatively effecting readability.

It is "mostly" possible to represent the ASCII range in 3x5 pixels 
(padded to 4x6 or 4x8), though some characters need to get "creative" 
and legibility is poor.

So, say, can't effectively do an 80x25 console at 320x200 pixels (and 
still have passable legibility), but 40x25 and 52x25 are possible.


For variable-size text rendering, was mostly using SDF's (signed 
distance fields).

Can cover most of the Unicode range by having converted Unifont into SDF 
cells (via an offline tool), but most of Unifont not render effectively 
at 8x8 or similar.

For best results at smaller sizes, for the 00..FF range, mostly still 
using an SDF derived from my 8x8 font (which was generally a bit more 
"robust" at the typical text sizes).


Had experimented with geometric "true type" style text rendering (1), 
but drawing directly as small glyphs did not work effectively. Had 
gotten best results by first drawing the glyphs at an "impractically 
large" size (say, 64x64 pixels) and then using this to generate an SDF 
image (usually represented as 16x16 pixels), then using the SDF to 
generate other text sizes.

In this case, bitmap glyphs are used for the actual rendering, but the 
SDF can be used to generate various size bitmap glyphs. Various stages 
of caching are used here.

Rendering large text using SDF's is liable to look wonky, but large text 
rendering is rare.


1: Although TrueType style font rendering originated in the 1980s, not 
sure how it would have been practical with 1980s level technology (say, 
machines with 1MHz CPUs and kB of RAM).

Strategies I had found were either computationally intensive or require 
first drawing the glyph at a large size (and then down-sampling in some 
way to get to the target size). Seemingly, Bitmap fonts would have 
presumably been a more practical option.


One downside of SDF's is that they are comparably bulky in terms of 
memory use, generally requiring around 8 bits per pixel (4b X, 4b Y). 
So, representing the entire Unicode BMP in uncompressed SDF form would 
need roughly 16MB. For the font, generally the images are stored in 
compressed (2) form (with each "plane" of 16x16 glyphs being 
decompressed as needed).

Then say, one has a cache of several planes (each needing 64K), noting 
that typically text rendering doesn't chaotically jump between planes.

Though, can note it isn't "that" much worse than using a binary 
conversion of Unifont (in 1bpp form), where, even if stored at 1bpp, is 
still around 1MB.


2: Decided to keep it shorter. Thus far, images are 256x256 and naively 
compressed with a byte-oriented (no entropy coder) LZ77 variant. More 
effective would be something with a pixel predictor and entropy coder 
(say, like PNG), but PNG decoding is too expensive. Budget option might 
be to simply subtract all the bytes from the previous bytes, and use an 
AdRice+STF+LZ77 style compressor (arguably "not as good" in terms of 
compression, but lower overheads vs something like Deflate).


>> [...]
>