Article <v3vbdc$24ouc$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v3vbdc$24ouc$1@dont-email.me>

Deutsch English Français Italiano

<v3vbdc$24ouc$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: ASCII to ASCII compression.
Date: Fri, 7 Jun 2024 11:09:06 -0500
Organization: A noiseless patient Spider
Lines: 106
Message-ID: <v3vbdc$24ouc$1@dont-email.me>
References: <v3snu1$1io29$2@dont-email.me> <874ja657s9.fsf@bsb.me.uk>
 <v3t1gf$1kia9$2@dont-email.me> <v3u5a3$1ul3c$1@dont-email.me>
 <v3ui89$20jte$1@dont-email.me> <v3uvq9$22s77$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 07 Jun 2024 18:10:20 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="bcd181c8a8249ff7c60382eea8a36cf2";
	logging-data="2253772"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19aAV5U4Dl0RGnEj2Twn/2R3NOm8G32dts="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:05bGkGigZpvxthaZR8MIcgAa534=
In-Reply-To: <v3uvq9$22s77$1@dont-email.me>
Content-Language: en-US
Bytes: 5700

On 6/7/2024 7:52 AM, Mikko wrote:
> On 2024-06-07 09:00:57 +0000, Malcolm McLean said:
> 
>> On 07/06/2024 06:20, Mikko wrote:
>>> On 2024-06-06 19:09:03 +0000, Malcolm McLean said:
>>>
>>>> On 06/06/2024 17:56, Ben Bacarisse wrote:
>>>>> Malcolm McLean <malcolm.arthur.mclean@gmail.com> writes:
>>>>>
>>>>>> Not strictly a C programming question, but smart people will see the
>>>>>> relavance to the topicality, which is portability.
>>>>>
>>>>> I must not be smart as I can't see any connection to the topic of this
>>>>> group!
>>>>>
>>>>>> Is there a compresiion algorthim which converts human language 
>>>>>> ASCII text
>>>>>> to compressed ASCII, preferably only "isgraph" characters?
>>>>>>
>>>>>> So "Mary had a little lamb, its fleece was white as snow".
>>>>>>
>>>>>> Would become
>>>>>>
>>>>>> QWE£$543GtT£$"||x|VVBB?
>>>>>
>>>>> Obviously such algorithms exist.  One that is used a lot is just 
>>>>> base64
>>>>> encoding of binary compressed text, but that won't beat something
>>>>> specifically crafted for the task which is presumably what you are
>>>>> asking for.  I don't know of anything aimed at that task specifically.
>>>>>
>>>>> One thing you should specify is whether you need it to work on small
>>>>> texts, or, even better, at what sort of size you want the pay-off to
>>>>> start to kick in.  For example, the xz+base64 encoding of the complete
>>>>> works of Shakespeare is still less than 40% of the size of the 
>>>>> original
>>>>> but your single line will end up much larger using that off-the-shelf
>>>>> scheme.
>>>>>
>>>> What I was thing of was using Huffman codes to convert ASCII to a 
>>>> string of of bits.
>>>
>>> Works if one knows at the time one makes ones compression and
>>> decmpression algorithms how often each short sequence of characters
>>> will be used in the files that will be compressed. If you have an
>>> adaptive Huffman coding (or any other adaptive coding) a single error
>>> will corrupt the rest of your line. If you reset the adaptation at the
>>> end of each line it does not adapt well and the result is not much
>>> better than without adaptation. If you reset the adaptation at the
>>> end of each page you can have better compression but an error corrupts
>>> the rest of the page.
>>>
>>> For ordinary texts (except short ones) and many other purposes 
>>> Lempel-Ziv
>>> and its variants work better than Huffman.
>>>
>> Yes, but Huffman is easy to decode. It's the sort of project you give 
>> to people who have just got past the beginner stage but aren't very 
>> experienced programmers yet, whilst implementing Lempel-Ziv is a job 
>> for someone who knows what he is doing.
>>
>> Because the lines will often be very short, adaptive Huffman coding is 
>> no good. I need a fixed Huffman table with 128 entries for each 7 bit 
>> value plus one for "stop". I wonder if any such standard table exists.
> 
> You don't need a standard table. You need statistics. Once you have the
> statistics the table is easy to costruct with Huffman's algorithm.
> 

Some compressors had used fixed Huffman tables, such as MPEG-1 and 
MPEG-2, but yeah.


One other option would be to make a table with characters sorted by 
probability (most common symbols first), and then use a Rice code (such 
as k=3 or k=4) to encode an index. This is arguably worse than Huffman, 
but faster to initialize (so can be preferable for small data).


An adaptive variant is possible, where encoding a symbol swaps it 
towards the front of the table, and the Rice factor is adapted based on 
the data.
This has the merit of generally being faster than adaptive Huffman, and 
can also be more compact than traditional Huffman for small data, due to 
not needing to encode a Huffman table, along with being faster to 
initialize.

Though, for small data, it may make sense to "pre-warm" the table 
(having a starting state that guesses a starting ranking and 
Rice-factor), if there might not actually be enough data present to 
adapt the table.


Though, one other route for string compression might be a Rice-encoded 
LZW variant, where the starting symbol set is extended with combined 
symbols up to a certain length (say, up to 4 bytes); possibly with the 
pre-warmed tables as before.

Would be a poor choice for bulk data compression, but could make sense 
for compressing text strings.



Could then translate this into a modified variant of Base85 or similar 
adapted to better fit the rules of C strings.