Article <v3v8j5$249sg$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v3v8j5$249sg$1@dont-email.me>
Deutsch English Français Italiano
<v3v8j5$249sg$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Paul <nospam@needed.invalid>
Newsgroups: comp.lang.c
Subject: Re: ASCII to ASCII compression.
Date: Fri, 7 Jun 2024 11:22:12 -0400
Organization: A noiseless patient Spider
Lines: 103
Message-ID: <v3v8j5$249sg$1@dont-email.me>
References: <v3snu1$1io29$2@dont-email.me> <v3t2bn$1ksfn$1@dont-email.me>
 <v3t9hf$1m1oh$1@dont-email.me> <v3tate$1m83t$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 07 Jun 2024 17:22:14 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="77eb54e1bd58f88a185cf0a76c1304b7";
	logging-data="2238352"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/DgfLRlQ++fFU3Ffc1967Uvd7N2bF7ro8="
User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802)
Cancel-Lock: sha1:cBI+bnH/Js+05ZnbwPDFRweqWKQ=
In-Reply-To: <v3tate$1m83t$1@dont-email.me>
Content-Language: en-US
Bytes: 6573

On 6/6/2024 5:49 PM, bart wrote:
> On 06/06/2024 22:26, Malcolm McLean wrote:
>> On 06/06/2024 20:23, Paul wrote:
>>> On 6/6/2024 12:25 PM, Malcolm McLean wrote:
>>>>
>>>> Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
>>>>
>>>> Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
>>>>
>>>> So "Mary had a little lamb, its fleece was white as snow".
>>>>
>>>> Would become
>>>>
>>>> QWE£$543GtT£$"||x|VVBB?
>>>>
>>>
>>> The purpose of doing this, is to satisfy transmission through a 7 bit channel.
>>> In the history of networking, not all channels were eight-bit transparent.
>>> (On the equipment in question, this was called "robbed-bit signaling.)
>>> For example, BASE64 is valued for its 7 bit channel properties, the ability
>>> to pass through a pipe which is not 8 bit transparent. Even to this day,
>>> your email attachments may traverse the network in BASE64 format.
>>>
>>> That is one reason, that email or USENET clients to this day, have
>>> both 7 bit and 8 bit content encoding methods. It's to handle the
>>> unlikely possibility that 7 bit transmission channels still exist.
>>> They likely do exist.
>>>
>> Yes. If yiu stire data as 8 but binaries then it's inherently risky. There's usually no recovery froma single bit gett corrupted.
>>
>> Whilst if you store as ASCII, the data can usually be recovered very easly if something goes wrong wit the phsyical storage. A "And God said"
>> becomes "And G$d said", an even with this tiny text, you can still read
>> it perfectly well.
> 
> But you are suggesting storing the compression data as meaningless ASCII such as:
> 
> QWE£$543GtT£$"||x|VVBB?
> 
> If one bit gets flipped, then it will just be slightly different meaningless ASCII; there's no way to detect it except checksums, CRCs and the like.
> 
> In any case, the error detection won't be done by a human, but machine.
> 
> Possibly a human might detect, when back in plain text that 'Mary hid a little lamb' should have been 'had', but now this is getting silly, needing to rely on knowledge of nursery rhymes.
> 
> Trillions of bytes binary data must be transmitted every day (perhaps every minute; I've no idea); how often have you encountered a transmission error?
> 
> Compression schemes tend to have error-detection built-in; I'm sure comms do as well, as well as storage device controllers and drivers. People have this sort of thing in hand already!
> 
> 

ZIP (of WinZIP fame), has a CRC computed per file. The decompression
step, will tell you if a file is corrupted. The column of CRC values
is shown in some of the unpacking software (and if you run a CRC check
separately on the file at a later date, you can compare).

    [Picture[

     https://i.postimg.cc/DwQgPQP3/ZIP-CRC-field.gif

True repair capability, requires a better code. The Reed Solomon David Brown
mentions is an example of such a code. A three dimensional version on CDs,
makes the CD very resistant to errors. By the time the Reed Solomon cannot
repair a CD, the CD surface is so bad, the laser can no longer lock to the groove.
Rather than Reed Solomon complaining it cannot correct the data, instead
it is the optical drive reporting it cannot find the groove using the laser.

Storage media also has repair capability. A typical SSD (NAND flash storage device),
has 10% overhead for corrections. A 512 byte sector, has an extra 51 bytes set
aside for error correction. When your SSD slows down to 300MB/sec from 530MB/sec,
that means that every sector being read had errors, and is being corrected by a
processor inside the SSD drive. This is a "normal" state of affairs for TLC
or QLC based drives. Some 2.5" flash devices, have a three core ARM processor,
and at least one of the cores does error correction.

But on an archival format with extreme compression, finding that "someone had
wasted an extra 10% on error correction capability", that would of course
annoy a user expecting the extreme compression to save them money (for storage).

When selecting a "scheme", you have to decide what kind of error-type you
are protecting against.

For example, on hard drives, someone postulated they were protecting against
single-bit (independent, does not correlate with other single-bits) errors.
The Fire codes (polynomial) were the result. There is some small probability
of multiple bits (perhaps an error multiplication effect in the DSP-based
data recovery on read). At the time, no one considered that a heavy-weight method
was necessary.

When you expect to be losing whole sectors, whole files, whole pieces of media,
there are PAR codes for that. But these were determined to be not mathematically
sound, so serious archival use might not use them. The idea would be, if an
archive spanned ten CDs, you would burn one or two more CDs (generated by PAR),
and if any of the twelve CDs total was bad, PAR could regenerate the
missing information (if any). Of the 12 CDs, any two could go missing, and they
could then be regenerated.

A simpler to understand scheme, is to burn duplicate CD copies of the same information.
If you lose a CD, or if the media surface degrades completely, you have the
second CD. And that does not involve any complex PAR method :-) It's easier
for the human to understand.

   Paul