Deutsch English Français Italiano |
<v3v8j5$249sg$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Paul <nospam@needed.invalid> Newsgroups: comp.lang.c Subject: Re: ASCII to ASCII compression. Date: Fri, 7 Jun 2024 11:22:12 -0400 Organization: A noiseless patient Spider Lines: 103 Message-ID: <v3v8j5$249sg$1@dont-email.me> References: <v3snu1$1io29$2@dont-email.me> <v3t2bn$1ksfn$1@dont-email.me> <v3t9hf$1m1oh$1@dont-email.me> <v3tate$1m83t$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Injection-Date: Fri, 07 Jun 2024 17:22:14 +0200 (CEST) Injection-Info: dont-email.me; posting-host="77eb54e1bd58f88a185cf0a76c1304b7"; logging-data="2238352"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/DgfLRlQ++fFU3Ffc1967Uvd7N2bF7ro8=" User-Agent: Ratcatcher/2.0.0.25 (Windows/20130802) Cancel-Lock: sha1:cBI+bnH/Js+05ZnbwPDFRweqWKQ= In-Reply-To: <v3tate$1m83t$1@dont-email.me> Content-Language: en-US Bytes: 6573 On 6/6/2024 5:49 PM, bart wrote: > On 06/06/2024 22:26, Malcolm McLean wrote: >> On 06/06/2024 20:23, Paul wrote: >>> On 6/6/2024 12:25 PM, Malcolm McLean wrote: >>>> >>>> Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability. >>>> >>>> Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters? >>>> >>>> So "Mary had a little lamb, its fleece was white as snow". >>>> >>>> Would become >>>> >>>> QWE£$543GtT£$"||x|VVBB? >>>> >>> >>> The purpose of doing this, is to satisfy transmission through a 7 bit channel. >>> In the history of networking, not all channels were eight-bit transparent. >>> (On the equipment in question, this was called "robbed-bit signaling.) >>> For example, BASE64 is valued for its 7 bit channel properties, the ability >>> to pass through a pipe which is not 8 bit transparent. Even to this day, >>> your email attachments may traverse the network in BASE64 format. >>> >>> That is one reason, that email or USENET clients to this day, have >>> both 7 bit and 8 bit content encoding methods. It's to handle the >>> unlikely possibility that 7 bit transmission channels still exist. >>> They likely do exist. >>> >> Yes. If yiu stire data as 8 but binaries then it's inherently risky. There's usually no recovery froma single bit gett corrupted. >> >> Whilst if you store as ASCII, the data can usually be recovered very easly if something goes wrong wit the phsyical storage. A "And God said" >> becomes "And G$d said", an even with this tiny text, you can still read >> it perfectly well. > > But you are suggesting storing the compression data as meaningless ASCII such as: > > QWE£$543GtT£$"||x|VVBB? > > If one bit gets flipped, then it will just be slightly different meaningless ASCII; there's no way to detect it except checksums, CRCs and the like. > > In any case, the error detection won't be done by a human, but machine. > > Possibly a human might detect, when back in plain text that 'Mary hid a little lamb' should have been 'had', but now this is getting silly, needing to rely on knowledge of nursery rhymes. > > Trillions of bytes binary data must be transmitted every day (perhaps every minute; I've no idea); how often have you encountered a transmission error? > > Compression schemes tend to have error-detection built-in; I'm sure comms do as well, as well as storage device controllers and drivers. People have this sort of thing in hand already! > > ZIP (of WinZIP fame), has a CRC computed per file. The decompression step, will tell you if a file is corrupted. The column of CRC values is shown in some of the unpacking software (and if you run a CRC check separately on the file at a later date, you can compare). [Picture[ https://i.postimg.cc/DwQgPQP3/ZIP-CRC-field.gif True repair capability, requires a better code. The Reed Solomon David Brown mentions is an example of such a code. A three dimensional version on CDs, makes the CD very resistant to errors. By the time the Reed Solomon cannot repair a CD, the CD surface is so bad, the laser can no longer lock to the groove. Rather than Reed Solomon complaining it cannot correct the data, instead it is the optical drive reporting it cannot find the groove using the laser. Storage media also has repair capability. A typical SSD (NAND flash storage device), has 10% overhead for corrections. A 512 byte sector, has an extra 51 bytes set aside for error correction. When your SSD slows down to 300MB/sec from 530MB/sec, that means that every sector being read had errors, and is being corrected by a processor inside the SSD drive. This is a "normal" state of affairs for TLC or QLC based drives. Some 2.5" flash devices, have a three core ARM processor, and at least one of the cores does error correction. But on an archival format with extreme compression, finding that "someone had wasted an extra 10% on error correction capability", that would of course annoy a user expecting the extreme compression to save them money (for storage). When selecting a "scheme", you have to decide what kind of error-type you are protecting against. For example, on hard drives, someone postulated they were protecting against single-bit (independent, does not correlate with other single-bits) errors. The Fire codes (polynomial) were the result. There is some small probability of multiple bits (perhaps an error multiplication effect in the DSP-based data recovery on read). At the time, no one considered that a heavy-weight method was necessary. When you expect to be losing whole sectors, whole files, whole pieces of media, there are PAR codes for that. But these were determined to be not mathematically sound, so serious archival use might not use them. The idea would be, if an archive spanned ten CDs, you would burn one or two more CDs (generated by PAR), and if any of the twelve CDs total was bad, PAR could regenerate the missing information (if any). Of the 12 CDs, any two could go missing, and they could then be regenerated. A simpler to understand scheme, is to burn duplicate CD copies of the same information. If you lose a CD, or if the media surface degrades completely, you have the second CD. And that does not involve any complex PAR method :-) It's easier for the human to understand. Paul