Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.lang.c Subject: Re: simple compression mathod to code itself? Date: Sun, 7 Apr 2024 15:08:57 -0500 Organization: A noiseless patient Spider Lines: 221 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Sun, 07 Apr 2024 20:09:01 +0200 (CEST) Injection-Info: dont-email.me; posting-host="9796307e8b8c0234504f926a63004b70"; logging-data="3139684"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+VqSyNdT2ppsxXf07+b1AZ1vpCfcMacG0=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:k3kB2UI5MJijuutE0isC7T4clnw= In-Reply-To: Content-Language: en-US Bytes: 7996 On 4/6/2024 11:41 AM, Scott Lurndal wrote: > David LaRue writes: >> fir wrote in news:uurefe$8i04$1@i2pn2.org: >> >>> fir wrote: >> >>> >>> if someone want to talk on compression adn how to code it i could like >>> to read it (as reading net articles on this may be seem to hard for my >>> old sick head and posts are much are easier to get into it) >> >> There are several good books on search and compression methods that provide >> examples of complexity and discussions about the complexity and performance. > > https://en.wikipedia.org/wiki/Lossless_compression > > One of the earliest: > > https://en.wikipedia.org/wiki/Huffman_coding > > Former colleague wrote this one: > > https://en.wikipedia.org/wiki/Gzip > > This was once under patent to a former employer: > > https://en.wikipedia.org/wiki/LZ77_and_LZ78 Also LZ4 is a moderately simple format, and works reasonably OK. I have another (similar category but different design) format that gets slightly better compression to LZ4 at a similar decode speed. I guess, if I were to try to quickly design another LZ format that was simple to decode, say: Tag is a 16 bit word (may fetch more bits though); LSB=0: Short Run (15:8): Match Distance (if non 0) ( 7:5): Raw Length (0..7) ( 4:1): Match Length (4..19) LSB=01: Longer Run (31:16): Match Distance (up to 64K) (15:11): Raw Length (0..31) (10: 2): Match Length (4..515) With a few special cases: tag=0x0000: EOB if md==0: if(ml!=4) rl+=(ml-3)*X; //longer run of literal bytes So, main decoder might look something like (pseudo C): cs=src; ct=dst; while(1) { tag=*(u32 *)cs; //assume misaligned safe and little endian if(!(tag&1)) { rl=(tag>>5)&7; ml=((tag>>1)&15)+4; md=(tag>>8)&255; cs+=2; if(!md) { if(!tag) break; //EOB if(ml!=4) rl+=(ml-3)*8; } }else if(!(tag&2)) { rl=(tag>>11)&31; ml=((tag>>2)&511)+4; md=(tag>>16)&65535; cs+=4; if(!md) { if(ml!=4) rl+=(ml-3)*32; } }else { //maybe more cases } if(rl) { memcpy(ct, cs, rl); cs+=rl; ct+=rl; } if(md) { matchcopy(ct, ct-md, ml); ct+=ml; } } The matchcopy function would resemble memcpy, but have different semantics in the case of self-overlap. Say, a simple version (not tuned for performance): void matchcopy(byte *dst, byte *src, int len) { if((src+len)