Article <v47r29$kqit$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v47r29$kqit$1@dont-email.me>
Deutsch English Français Italiano
<v47r29$kqit$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feed.opticnetworks.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB-Alt <bohannonindustriesllc@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: ASCII to ASCII compression.
Date: Mon, 10 Jun 2024 16:26:32 -0500
Organization: A noiseless patient Spider
Lines: 153
Message-ID: <v47r29$kqit$1@dont-email.me>
References: <v3snu1$1io29$2@dont-email.me> <v3spmv$1jbjq$1@dont-email.me>
 <v3t150$1kia9$1@dont-email.me> <v3ukbb$20s0s$3@dont-email.me>
 <v3uva6$22nnp$1@dont-email.me> <v3vs75$27u7g$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 10 Jun 2024 23:26:34 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ce69f12dfffc4a38b1540a2347f3d48b";
	logging-data="682589"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18XdcZUxXlsAfTc7bXNkIaPogrsS8SeJaY="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:S2u+dTCDKi2heLMcgu4vSOa7+NM=
In-Reply-To: <v3vs75$27u7g$1@dont-email.me>
Content-Language: en-US
Bytes: 8375

On 6/7/2024 3:57 PM, Paul wrote:
> On 6/7/2024 8:43 AM, Malcolm McLean wrote:
>> On 07/06/2024 10:36, David Brown wrote:
>>> On 06/06/2024 21:02, Malcolm McLean wrote:
>>>> On 06/06/2024 17:55, bart wrote:
>>>>> On 06/06/2024 17:25, Malcolm McLean wrote:
>>>>>>
>>>>>> Not strictly a C programming question, but smart people will see the relavance to the topicality, which is portability.
>>>>>>
>>>>>> Is there a compresiion algorthim which converts human language ASCII text to compressed ASCII, preferably only "isgraph" characters?
>>>>>>
>>>>>> So "Mary had a little lamb, its fleece was white as snow".
>>>>>>
>>>>>> Would become
>>>>>>
>>>>>> QWE£$543GtT£$"||x|VVBB?
>>>>>
>>>>> What's the problem with compressing to binary (using existing, efficient utilities), then turning that binary into ASCII (like Mime or Base64)?
>>>>>
>>>> Because if a single bit flips in a zip archive, it's likely the entire archive will be lost. This scheme is robust. We can emed compressed text in programs, and if it is corruped, only a single line will become unreadable.
>>>
>>> Ah, you want something that will work like your newsreader program that randomly changes letters or otherwise corrupts your spelling while leaving most of it readable?  :-)
>>>
>>> Pass the data through a compressor and then add forward error checking mechanisms such as Reed-Solomon codes.  Then convert to ASCII base64 or similar.
>>>
>> Yes, exactly.
>>
>> I want a system for compression which is robust to corruption, can be stored as text, and with a compressor / decompressor which can be written by a child hobby programmer with only a very little bit of experience of programming.
>>
>> That's what I need for Baby X. The FileSystem XML files can get very large, and of course Baby X programmers are going to ask about compression. And I don't think there is an existing system, and so I shall devise one.
>>
> 
> "XML Compression"
> 
> https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7993-3_783-2
> 
>     "The size increase incurred by publishing data in XML format is
>      estimated to be as much as 400 % [14], making it a prime target for compression.
> 
>      While standard general-purpose compressors, such as
>      zip, gzip or bzip, typically compress XML data reasonably well...
>     "
> 
> Show us a "dir" or an "ls -al" so we can better understand
> the magnitude of what you're working on.
> 
> Lots of things have used ZIP, implicitly or explicitly, mainly
> because it is a kind of standard and does not form a barrier to access.
> 
> In addition, if a structure is voluminous (a thousand control files
> representing one project), users appreciate having them stored in
> a container, rather than filling the file system with fluff. A ZIP
> can do that too. And if the ZIP has a convenient library you can
> get from FOSS-land, that could save time on building a standards
> based container.
> 

One downside of ZIP is that it is a moderately heavyweight format to 
work with.

For some of my own uses, I had created "WAD2A" and "WAD4" formats which 
can address similar use cases, but without some of the implied overhead 
of processing the ZIP central directory.

WAD2A is a tweaked version of the WAD2 format (from Quake and Half-Life) 
which adds support for directory trees, and actually uses the data 
compression parts. Downside of WAD2A is that non-root lump names are 
effectively limited to 12 characters (vs the 16-char name limit for root 
lumps).

The WAD4 format was similar, but expanded the dirent size, and had 
32-character lump names, also organized into a directory tree.


Also generally, I had used LZ4 and my own RP2 compression, rather than 
Deflate, because Deflate is also fairly expensive (particularly on a 
50MHz CPU); mostly due to the relatively high cost of setting up Huffman 
tables, and also decoding data with them.

Where, RP2 is also a byte-oriented LZ compressor (like LZ4), but 
generally getting slightly better compression (for general purpose data) 
at a similar decode speed (though, I have noted that LZ4 does better for 
some other types of data, such as machine code, so I ended up mostly 
using LZ4 for compressing things like program binaries).

Curiously, LZ4 seems to do better with both my own ISA and with RISC-V, 
so there is something in the typical compiler output that favors LZ4.


I had also implemented a few simpler Huffman based formats, but can't 
really get up to similar speeds.


Had also come up with a sort of "pseudo entropic" encoding, which 
managed to still gain some compression in past tests (while also being 
faster than an "actual" entropy coding scheme, and was still byte-oriented).

IIRC:
   Rank symbols based on probability, encode as indices into table.
   00..7F: Encode a symbol, 0..127
   80..F8: Encode a symbol Pair (0..11)
   FF: Escape code a symbol (byte)

Had considered another possibility:
   0000..3FFF: Symbol Pair (0..127)
   4000..7D08: Symbol Triple (0..25)
   7F00..7FFF: Single Symbol (0..255)
   8000..FFFF: Symbol Quad (0..13)

But, didn't get around to experimenting with this.


Downside of these schemes is that division-by-constant eats a lot of the 
potential speed gains over Huffman (it can be turned into multiply by 
reciprocal, but this is still "not very fast"; if it were doing 
general-purpose division, it would be a dead loss).

Similarly, for the latter form, it would be too large to use a lookup 
table (trying to do so would likely also eat most of its performance), 
though since each table lookup potentially does multiple symbols, it 
would not necessarily be slower than Huffman in this case.

Some alternate twiddly could be possible if one were assuming the use of 
specialized CPU helper instructions to pack/unpack the indices (doing 
tricks similar to Decimal / DPD encoding, rather than using 
multiply-by-reciprocal trickery). But, probably not worthwhile (and 
would likely make it slower for a pure software decoder, except in the 
8-bit case which could use a lookup table).


> But what's more important than any techie adventure, is not
> annoying your users. What do the users want most ? The ability
> to edit the files in question, on a moments notice ? Or would
> the files, 99.999% of the time, comfortably remain hidden from view ?
> 
> If the "blob" involved was 100GB, then yes, I'd compress it :-)
> If it is 4KB, well, those little files are a nuisance no matter
> what you do. I would leave that uncompressed, unless I could
> containerize it perhaps.
> 
> As an example, Mozilla has used .jsonlz4 as a file format solution.
> I have no idea what problem they thought they were solving,
> but I can tell you I consider the solution obnoxious and inconsiderate
> of the user. LZ4 decompressors are not a stockroom item. I had
> to write a very short program, so I could deal with that. Mozilla
> has made a perfect example of what not to do, by doing that.
> 

LZ4 is a fairly simple format though, so a person can implement it in a 
few hundred lines of C if needed.

>     Paul