Article <uuuugs$2vq34$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <uuuugs$2vq34$1@dont-email.me>
Deutsch English Français Italiano
<uuuugs$2vq34$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: simple compression mathod to code itself?
Date: Sun, 7 Apr 2024 15:08:57 -0500
Organization: A noiseless patient Spider
Lines: 221
Message-ID: <uuuugs$2vq34$1@dont-email.me>
References: <uupflr$5t1n$1@i2pn2.org>
 <XnsB14BC6028E8BAhueydlltampabayrrcom@135.181.20.170>
 <uurd34$8ga0$1@i2pn2.org> <uurefe$8i04$1@i2pn2.org>
 <XnsB14C7BB22DDD2hueydlltampabayrrcom@135.181.20.170>
 <mXeQN.385487$q3F7.369853@fx45.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 07 Apr 2024 20:09:01 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="9796307e8b8c0234504f926a63004b70";
	logging-data="3139684"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+VqSyNdT2ppsxXf07+b1AZ1vpCfcMacG0="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:k3kB2UI5MJijuutE0isC7T4clnw=
In-Reply-To: <mXeQN.385487$q3F7.369853@fx45.iad>
Content-Language: en-US
Bytes: 7996

On 4/6/2024 11:41 AM, Scott Lurndal wrote:
> David LaRue <huey.dll@tampabay.rr.com> writes:
>> fir <fir@grunge.pl> wrote in news:uurefe$8i04$1@i2pn2.org:
>>
>>> fir wrote:
>> <snip>
>>>
>>> if someone want to talk on compression adn how to code it i could like
>>> to read it (as reading net articles on this may be seem to hard for my
>>> old sick head and posts are much are easier to get into it)
>>
>> There are several good books on search and compression methods that provide
>> examples of complexity and discussions about the complexity and performance.
> 
> https://en.wikipedia.org/wiki/Lossless_compression
> 
> One of the earliest:
> 
> https://en.wikipedia.org/wiki/Huffman_coding
> 
> Former colleague wrote this one:
> 
> https://en.wikipedia.org/wiki/Gzip
> 
> This was once under patent to a former employer:
> 
> https://en.wikipedia.org/wiki/LZ77_and_LZ78

Also LZ4 is a moderately simple format, and works reasonably OK.

I have another (similar category but different design) format that gets 
slightly better compression to LZ4 at a similar decode speed.



I guess, if I were to try to quickly design another LZ format that was 
simple to decode, say:
   Tag is a 16 bit word (may fetch more bits though);
   LSB=0: Short Run
     (15:8): Match Distance (if non 0)
     ( 7:5): Raw Length (0..7)
     ( 4:1): Match Length (4..19)
   LSB=01: Longer Run
     (31:16): Match Distance (up to 64K)
     (15:11): Raw Length (0..31)
     (10: 2): Match Length (4..515)

With a few special cases:
   tag=0x0000: EOB
   if md==0:
     if(ml!=4)
       rl+=(ml-3)*X;  //longer run of literal bytes

So, main decoder might look something like (pseudo C):
   cs=src;
   ct=dst;
   while(1)
   {
     tag=*(u32 *)cs;  //assume misaligned safe and little endian
     if(!(tag&1))
     {
       rl=(tag>>5)&7;
       ml=((tag>>1)&15)+4;
       md=(tag>>8)&255;
       cs+=2;
       if(!md)
       {
         if(!tag)
           break;  //EOB
         if(ml!=4)
           rl+=(ml-3)*8;
       }
     }else if(!(tag&2))
     {
       rl=(tag>>11)&31;
       ml=((tag>>2)&511)+4;
       md=(tag>>16)&65535;
       cs+=4;
       if(!md)
       {
         if(ml!=4)
           rl+=(ml-3)*32;
       }
     }else
     {
        //maybe more cases
     }

     if(rl)
     {
       memcpy(ct, cs, rl);
       cs+=rl; ct+=rl;
     }

     if(md)
     {
       matchcopy(ct, ct-md, ml);
       ct+=ml;
     }
   }

The matchcopy function would resemble memcpy, but have different 
semantics in the case of self-overlap.

Say, a simple version (not tuned for performance):
   void matchcopy(byte *dst, byte *src, int len)
   {
     if((src+len)<dst)
     {
        //no overlap
        memcpy(dst, src, len);
        return;
     }
     while(len--)
       *dst++=*src++;
   }

The match copy operation tends to bigger and more complicated if one 
tries to make it faster (where generally byte-for-byte copies are rather 
slow, and the case of long repeating RLE like runs isn't particularly rare).


Designing an LZ encoder is more of a challenge though.
Where typically, speed, compression, and code simplicity, are all 
mutually opposed (a simple LZ encoder will be either slow or give crappy 
compression, though speed is a matter of perspective).


Note that comparably, Deflate is a fairly complicated format.





Huffman coding is effective, but as noted, comes at a cost in terms of 
both speed and code complexity (but is still much faster than something 
like range coding).

Though, partial ways to make Huffman decoding a little faster (in the 
design of a compressor):
Limit maximum symbol length to around 12 or 13 bits, which allows lookup 
tables to fit in the L1 caches of typical CPUs;
Decode blocks of symbols in advance, then fetch symbols as-needed from 
these blocks (this allows overlapping the Huffman symbol decoding, 
making more effective use of the CPU pipeline).

These designs had sort of ended up resembling an entropy-coded version 
of LZ4 (usually with Huffman coded blocks for tags, literal bytes, and 
length/distance prefixes; using a scheme similar to Deflate's distance 
coding scheme for longer match or raw lengths, as well as for match 
distance). Note that after decoding the symbol blocks, the bitstream 
would only contain "extra bits" for the distance coding.

But that said, on a modern PC style CPU, it is rather difficult to get 
Huffman coded designs much over about 600 or 700 MB/sec for decode 
speed, even with these tricks (if one skips an entropy-coding stage, 
typically several GB/sec is possible for LZ decoding).



Though, one possibility could be a pseudo-entropy scheme, say:
   Sort all the symbols into descending frequency order;
     Stored directly as a table of 128 bytes (skip uncommon bytes).
   Encode the symbol blocks with bytes encoding table indices, say:
   00..78: Symbol Pair (0..10)
   7F: Escaped Symbol (encoded as a byte)
   80..FF: Direct Index (0..127)

With the block being encoded as a blob of raw bytes if the 
pseudo-entropy scheme didn't save enough space (say, if there isn't a 
significant probability skew). Idea being that this would be faster to 
decode than blobs of Huffman-coded symbols.


Then, say, block is encoded as a 16-bit prefix, say:
========== REMAINDER OF ARTICLE TRUNCATED ==========