Deutsch English Français Italiano |
<v47n0q$jtir$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Malcolm McLean <malcolm.arthur.mclean@gmail.com> Newsgroups: comp.lang.c Subject: Re: ASCII to ASCII compression. Date: Mon, 10 Jun 2024 21:17:30 +0100 Organization: A noiseless patient Spider Lines: 48 Message-ID: <v47n0q$jtir$1@dont-email.me> References: <v3snu1$1io29$2@dont-email.me> <v45iak$3t1l5$1@dont-email.me> <v465h9$76f0$1@dont-email.me> <87tti03co9.fsf@bsb.me.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Mon, 10 Jun 2024 22:17:31 +0200 (CEST) Injection-Info: dont-email.me; posting-host="a5f2fee6498babfeedcde7339d6d2227"; logging-data="652891"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/3QaTCzE6IZ21T2SDBL7xf+MVTGbdk7RM=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:wHQQZhD43/43+7BHdHImqR9EetQ= Content-Language: en-GB In-Reply-To: <87tti03co9.fsf@bsb.me.uk> Bytes: 2798 On 10/06/2024 18:55, Ben Bacarisse wrote: > Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes: > >> We have a fixed Huffman tree which is part of the algorithm and optmised >> for ASCII. And we take each line otext, and comress it to a binary string, >> using the Huffman table. The we code the binary string six bytes ar a time >> using a 64 character dubset of ASCCI. And the we append a special character >> which is chosen to be visually distinctive.. >> >> So the inout is >> >> Mary had a little lamb, >> it's fleece was white as snow, >> and eveywhere that Mary went, >> the lamb was sure to. go. >> >> And we get the output. >> >> CVbGNh£-H$£*MMH&-VVdsE3w2as3-vv$G^&ggf- > > It would be more like > > pOHcDdz8v3cz5Nl7WP2gno5krTqU6g/ZynQYlawju8rxyhMT6B30nDusHrWaE+TZf1KdKmJ9Fb6orB > > (That's an actual example using an optimal Huffman encoding for that > input and the conventional base 64 encoding. I can post the code table, > if you like.) > >> And if it shorter or not depends on whether the fixed Huffman table is any >> good. > > If I use a bigger corpus of English text to derive the Huffman codes, > the encoding becomes less efficient (of course) so those 110 characters > need more like 83 base 64 encoded bytes to represent them. Is 75% of > the size worth it? > > What is the use-case where there is so much English text that a little > compression is worthwhile? > The FileSystem XML files. They are uncompressed, and as you can take in entire folders, they can be very large. But the compression is rather diappointing. -- Check out my hobby project. http://malcolmmclean.github.io/babyxrc