Any compressor between rle and lz4 ?

pirx · May 16, 2019

Yes unfortunately for SAP-R, the window has to be 128-256 bytes for any decent compression, [...]

I've just got this idea, what about using Huffman coding where alphabet is build from 72-bit letters (9 bytes). If the entropy is low (and it is in SAP Type-R), this would yeld well over 9 times compression.

Just looked at the real data - 89KiB SAP contains 790 symbols with this frequency:

1,8,8,8,1,6,6,8,1,8,8,6,8,1,2,142,69,59,142,47,3,24,20,22,7,2,3,25,12,12,12,60,5,28,28,29,3,10,22,14,13,18,2,2,2,1,1,1,1,1,5,14,32,11,17,5,48,8,22,16,20,1,44,5,22,17,17,3,27,8,12,9,7,2,1,1,2,8,8,14,5,4,3,20,8,7,9,1,6,6,8,6,6,6,3,6,6,3,3,7,1,9,6,7,6,3,3,6,1,11,4,2,6,10,5,6,5,4,2,2,1,1,4,5,2,1,1,6,19,15,18,14,11,14,3,15,18,14,7,3,14,47,64,38,138,142,17,88,72,56,4,18,296,47,29,22,16,68,48,38,23,4,12,16,12,8,4,6,4,4,2,5,2,3,4,2,20,40,20,16,10,16,11,9,52,3,13,31,31,24,12,2,64,28,23,24,14,10,14,10,5,2,2,4,20,13,11,123,19,12,16,12,12,12,18,12,12,12,12,34,12,12,6,13,12,6,12,12,12,12,16,14,11,11,6,12,6,3,39,9,6,9,6,6,6,9,6,6,3,9,12,85,35,12,12,6,18,12,17,12,12,12,18,12,12,12,2,6,9,7,6,61,1,8,6,9,6,6,6,9,6,6,6,6,16,6,6,3,2,6,6,3,6,6,6,6,14,10,1,5,1,6,9,6,62,7,6,7,6,6,6,9,6,6,6,6,18,6,6,3,2,6,6,6,6,6,6,9,6,6,6,1,8,2,3,6,2,6,11,5,8,6,9,6,6,6,9,6,6,1,1,1,1,2,7,4,1,1,2,1,1,1,1,1,1,1,8,6,4,2,1,1,1,2,2,2,2,1,1,1,1,4,32,84,4,4,51,8,32,83,5,2,5,57,6,16,22,2,4,1,2,59,1,2,8,6,4,4,8,6,8,8,4,6,4,4,8,4,10,9,4,3,2,2,4,3,4,4,2,3,2,2,4,2,5,4,2,3,2,4,3,4,4,10,1,11,11,11,1,5,108,66,52,48,90,97,92,78,71,92,90,103,15,74,66,7,26,70,101,54,50,20,2,4,10,14,8,8,6,4,6,4,2,2,4,6,4,2,4,2,2,4,5,5,4,2,17,37,21,16,6,4,1,5,4,4,4,6,4,2,4,3,1,4,2,10,14,8,8,14,12,17,12,9,8,12,8,1,7,8,1,5,10,14,8,8,4,3,4,15,10,8,8,1,4,6,4,4,4,4,8,4,4,2,5,13,17,13,12,8,22,6,24,60,88,47,43,29,29,31,24,3,10,13,12,10,6,9,22,14,10,3,15,13,15,12,6,5,3,1,14,17,19,13,9,1,9,5,3,4,1,5,4,4,4,4,1,4,4,5,4,4,2,15,35,52,28,24,42,28,3,39,28,18,6,22,25,26,16,23,3,16,13,17,23,18,5,14,11,7,16,6,4,6,4,2,4,4,4,2,2,4,2,2,6,3,6,6,6,10,1,3,7,13,10,3,12,10,5,3,15,12,18,12,6,6,12,15,10,3,12,10,5,3,1,1,1,1,1,1,1,1,1,4,9,15,7,7,1,7,6,1,7,6,1,1,10,15,10,10,6,4,1,1,1,2,2,1,1,1,1,1,2,30,5,3,7,5,71,6,6,7,7,1,8,1,9,6,6,11,2,5,2,6,1,1,2,1,3,3,1,10,10,1,1,2,1,2,4,3,2,2,2,2,1,2,3,2,2,2,1,2,2,3,2,1,1,1,1,2,2,1,2,2,1,1,1,1,1,1

a lot of singletons, unfortunately, but still looks promising.

The only issue will be the large symbol table (do not remember the correct name).

Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :]

rensoup · May 16, 2019

Pure LZSS might be what you're looking for. The compression ratio is worse than LZ4, but the coding for the decompressor is absolutely trivial.

It is what a lot of games in the late 1980s and early 1990s used before programmers tweaked the details of the algorithm to get even better compression, but at some cost in speed.

Thanks, I found a 65c816 decoder for it ( https://forums.nesdev.com/viewtopic.php?f=12&t=5771) but its author says it requires 4KB of mem... that unfortunately seems like a no go for me!

It's a really simple concept, but, as you've found, you need to keep the decompressed data around in order to be referenced.

This usually isn't a problem, but you do need to know that you've got to do that.

Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length.

I'm just surprised that nobody has done that.

I actually gave it a shot yesterday... it was a lot more annoying than I thought but it seems to work. The gains vary from nothing to about 25% compared to regular RLE and I get the feeling there's more that can be done.

LZ4 was optimized for speed of compression/decompression for use in realtime on 32-bit CPUs ... I've yet to find a case where it actually makes sense to use it on 8-bit or 16-bit computers/consoles.

Something like Jorgen Ibsen's aPlib offers better compression, with only a little extra cost in speed ... and if you really need the speed, then pure LZSS is hard to beat.

At 300-600 bytes/frame I would agree but dmsc's LZ4 decoder benchmark seems promising though.

Edited May 16, 2019 by rensoup

rensoup · May 16, 2019

Did you benchmark LZO? It's supposed to be designed around decompression speed.

Also, if the idea is just to blast this to the screen as you decompress, you could even add a special flag to the decompressor which says "Move to this address before continuing to decompress". Then you can just let the decompression loose on an entire run of data to do a frame update without having to write "same" data to parts of the screen.

found this post: https://forums.nesdev.com/viewtopic.php?f=2&t=13671

I started with the assumption LZO would be best of the no-RAM algos, and indeed it was. Its speed was a surprise however, on desktops it is extremely fast, here it's over twice as slow as zlib. Being GPL it is not usable for many closed projects, though commercial licenses are available.

ouch...

rensoup · May 16, 2019

no.

compressed data: 15kb

decompresed: 35147 bytes

so: 35147 / 55 = 639 bytes

Ooops you're right. dmsc's version seems a lot faster though.

dmsc · May 16, 2019

Hi!

Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length.

I'm just surprised that nobody has done that.

I actually gave it a shot yesterday... it was a lot more annoying than I thought but it seems to work. The gains vary from nothing to about 25% compared to regular RLE and I get the feeling there's more that can be done.

LZ4 is exactly that, it stores offsets to old data instead of repeating it. LZSS is not used anymore because decompression is slower than LZ4, because you have to do a bit shift for each literal (i.e. copied) byte, as opposed to LZ4 where you have a count of literal bytes beforehand.

Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream. Remember that you need to compare the memory needs of the decoder plus the buffers plus the data structures when comparing the compression performance. This is why using Huffman coding is not very appealing, you need about 512 bytes of tables to hold the trees and codes, those bytes could be better used as a buffer for LZ* compression.

rensoup · May 16, 2019

In my code above I made an optimization: I replace the "copy distance" field with a "copy address" field (= position - distance), so the copy can be done directly. This also allows (by modifying the compressor) to use *any* place in the target memory as a source, so you could stream frames using the read buffer and the target screen, so you can send only changed bytes.

Ok that was confusing... you're saying that instead of of using the regular LZ4 encoder, I could use the LZ4 api to set the LZ4 dictionary to come from the previously compressed source data, instead of the decompressed destination ?

something like:

For every new chunk of 9 bytes to compress:

-Set dictionary to be compressed source data, so the dictionary grows a bit more for every 9 byte chunk which in theory improves compression every time

-Compress 9 bytes using new dictionary.

?

LZ4 is exactly that, it stores offsets to old data instead of repeating it.

Yes that bit I know, but these are offset in the decompressed data but from you're saying above I could change that to use offsets in the compressed source ?

Edited May 16, 2019 by rensoup

rensoup · May 16, 2019

I've just got this idea, what about using Huffman coding where alphabet is build from 72-bit letters (9 bytes). If the entropy is low (and it is in SAP Type-R), this would yeld well over 9 times compression.

Just looked at the real data - 89KiB SAP contains 790 symbols with this frequency:

1,8,8,8,1,6,6,8,1,8,8,6,8,1,2,142,69,59,142,47,3,24,20,22,7,2,3,25,12,12,12,60,5,28,28,29,3,10,22,14,13,18,2,2,2,1,1,1,1,1,5,14,32,11,17,5,48,8,22,16,20,1,44,5,22,17,17,3,27,8,12,9,7,2,1,1,2,8,8,14,5,4,3,20,8,7,9,1,6,6,8,6,6,6,3,6,6,3,3,7,1,9,6,7,6,3,3,6,1,11,4,2,6,10,5,6,5,4,2,2,1,1,4,5,2,1,1,6,19,15,18,14,11,14,3,15,18,14,7,3,14,47,64,38,138,142,17,88,72,56,4,18,296,47,29,22,16,68,48,38,23,4,12,16,12,8,4,6,4,4,2,5,2,3,4,2,20,40,20,16,10,16,11,9,52,3,13,31,31,24,12,2,64,28,23,24,14,10,14,10,5,2,2,4,20,13,11,123,19,12,16,12,12,12,18,12,12,12,12,34,12,12,6,13,12,6,12,12,12,12,16,14,11,11,6,12,6,3,39,9,6,9,6,6,6,9,6,6,3,9,12,85,35,12,12,6,18,12,17,12,12,12,18,12,12,12,2,6,9,7,6,61,1,8,6,9,6,6,6,9,6,6,6,6,16,6,6,3,2,6,6,3,6,6,6,6,14,10,1,5,1,6,9,6,62,7,6,7,6,6,6,9,6,6,6,6,18,6,6,3,2,6,6,6,6,6,6,9,6,6,6,1,8,2,3,6,2,6,11,5,8,6,9,6,6,6,9,6,6,1,1,1,1,2,7,4,1,1,2,1,1,1,1,1,1,1,8,6,4,2,1,1,1,2,2,2,2,1,1,1,1,4,32,84,4,4,51,8,32,83,5,2,5,57,6,16,22,2,4,1,2,59,1,2,8,6,4,4,8,6,8,8,4,6,4,4,8,4,10,9,4,3,2,2,4,3,4,4,2,3,2,2,4,2,5,4,2,3,2,4,3,4,4,10,1,11,11,11,1,5,108,66,52,48,90,97,92,78,71,92,90,103,15,74,66,7,26,70,101,54,50,20,2,4,10,14,8,8,6,4,6,4,2,2,4,6,4,2,4,2,2,4,5,5,4,2,17,37,21,16,6,4,1,5,4,4,4,6,4,2,4,3,1,4,2,10,14,8,8,14,12,17,12,9,8,12,8,1,7,8,1,5,10,14,8,8,4,3,4,15,10,8,8,1,4,6,4,4,4,4,8,4,4,2,5,13,17,13,12,8,22,6,24,60,88,47,43,29,29,31,24,3,10,13,12,10,6,9,22,14,10,3,15,13,15,12,6,5,3,1,14,17,19,13,9,1,9,5,3,4,1,5,4,4,4,4,1,4,4,5,4,4,2,15,35,52,28,24,42,28,3,39,28,18,6,22,25,26,16,23,3,16,13,17,23,18,5,14,11,7,16,6,4,6,4,2,4,4,4,2,2,4,2,2,6,3,6,6,6,10,1,3,7,13,10,3,12,10,5,3,15,12,18,12,6,6,12,15,10,3,12,10,5,3,1,1,1,1,1,1,1,1,1,4,9,15,7,7,1,7,6,1,7,6,1,1,10,15,10,10,6,4,1,1,1,2,2,1,1,1,1,1,2,30,5,3,7,5,71,6,6,7,7,1,8,1,9,6,6,11,2,5,2,6,1,1,2,1,3,3,1,10,10,1,1,2,1,2,4,3,2,2,2,2,1,2,3,2,2,2,1,2,2,3,2,1,1,1,1,2,2,1,2,2,1,1,1,1,1,1

a lot of singletons, unfortunately, but still looks promising.

The only issue will be the large symbol table (do not remember the correct name).

Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :]

Damn and I was hoping for an existing solution :twisted:

dmsc · May 16, 2019

Hi!

Ok that was confusing... you're saying that instead of of using the regular LZ4 encoder, I could use the LZ4 api to set the LZ4 dictionary to come from the previously compressed source data, instead of the decompressed destination ?

something like:

For every new chunk of 9 bytes to compress:

-Set dictionary to be compressed source data, so the dictionary grows a bit more for every 9 byte chunk which in theory improves compression every time

-Compress 9 bytes using new dictionary.

?

Yes that bit I know, but these are offset in the decompressed data but from you're saying above I could change that to use offsets in the compressed source ?

Yes.

In my source code, to simplify the decoder, I parse all the LZ4 blocks, replacing the offsets (that are stored as "bytes before the current position") to the actual address from which to copy the data. This is possible as I know all the addresses that the 6502 code will use.

Also, if you want to compress blocks of only 9 bytes, the compression will not work very well (as the minimal block size in LZ4 is 3 bytes). Try to compress at least 8 frames (72 bytes) on each block. As I said, using 150 bytes for the code and then trying to spare bytes in the block is not helpful, as you *always* have buffers in memory (i.e., the disk-reading buffer, the 6502 stack, etc.).

dmsc · May 16, 2019

Hi!

In my source code, to simplify the decoder, I parse all the LZ4 blocks, replacing the offsets (that are stored as "bytes before the current position") to the actual address from which to copy the data. This is possible as I know all the addresses that the 6502 code will use.

Also, if you want to compress blocks of only 9 bytes, the compression will not work very well (as the minimal block size in LZ4 is 3 bytes). Try to compress at least 8 frames (72 bytes) on each block. As I said, using 150 bytes for the code and then trying to spare bytes in the block is not helpful, as you *always* have buffers in memory (i.e., the disk-reading buffer, the 6502 stack, etc.).

Also, another point. The LZ4 format specifies that the las block *must* have 5 literal bytes (an 8 byte block), this is to simplify writing decoders for 32 bit CPUs. In the 6502, the last block could have only one literal, you can modify the line

#define LASTLITERALS   5   /* see ../doc/lz4_Block_format.md#parsing-restrictions */

I set it to 1 and reduced the block sizes.

rensoup · May 17, 2019

Hi!

Yes.

Well that's some weird stuff :skull:

Not sure it's worth it, let me pm you to explain...

elmer · May 17, 2019

Thanks, I found a 65c816 decoder for it ( https://forums.nesdev.com/viewtopic.php?f=12&t=5771)but its author says it requires 4KB of mem... that unfortunately seems like a no go for me!

You just found the source for one particular decoder that uses a fixed ring-buffer in memory. You wouldn't need that on the Atari.

The ring-buffer technique is useful when decompressing to an area that you can't randomly read from, such as the VRAM in a NES.

The LZSS algorithm has no more overhead than LZ4.

BTW, traditional LZSS uses a 16-bit value for each match, 12-bits for the match position and 4-bits for the match length.

Various games changed this to 8-bits for match position and 4-bits for match length.

That hurts compression ratio, but it can make the decompressor a bit faster, and it doesn't hurt you much if you're decompressing smallish amounts of data at a time.

found this post: https://forums.nesdev.com/viewtopic.php?f=2&t=13671

ouch...

That's another terrible example of decompression, because the OP is comparing one hand-optimized assembly-language decompressor to a bunch of decompressors compiled from C using CC65 ... which is a terrible idea!

Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length.

I'm just surprised that nobody has done that.

It sounds like you're talking about using an offset/length into the *compressed* data stream, rather than the *decompressed* data stream.

Most people come up with that idea at some point, but when they try to implement it, they find that it doesn't work well in practice, and your compression ratio goes all-to-heck (compared to any of the LZ schemes).

Damn and I was hoping for an existing solution

This is the most important thing that you've said so far.

If you really want to just take existing code, then dmsc has offered you a perfectly-usable, and pretty-optimal solution.

The key for any of the LZ variants is to get that copy-loop as fast as you can.

dsmc's inner loop for that is about as good as you're going to get on the original NMOS 6502.

That inner loop will work on *any* of the LZ variants.

Beyond that, speedups tend to come from removing subroutine calls and inlining stuff.

Edited May 17, 2019 by elmer

elmer · May 17, 2019

LZSS is not used anymore because decompression is slower than LZ4, because you have to do a bit shift for each literal (i.e. copied) byte, as opposed to LZ4 where you have a count of literal bytes beforehand.

We'll have to agree to disagree on that. ;-)

The LZSS bitshift, traditionally of a zero-page location, is much quicker than all of the setup that LZ4 has to do for a copy-literals-from-compressed-data.

LZ4 will only start to outpace a well-written 6502 LZSS decompressor *if* the count of literals to copy at one time is large-enough.

Here's a post that I made on the subject a few years ago, regarding the game data for a PC Engine translation (which, if you're not familiar with the console, runs a customized 7.16MHz 65C02 variant with a bunch of extra, and highly useful, new instructions) ...

One of the "big ideas" in the currently-popular LZ4 compression scheme is that you can COPY multiple bytes at a time, rather than using a bit for each one to say that it's a COPY and not an LZSS match.

This also allows LZ4 to guarantee that a COPY is followed by an LZSS match, which saves another bit.

Unfortunately, if an LZSS match directly follows another LZSS match, then 4 bits are wasted to indicate a COPY length of zero.

So I decided to look at the Xanadu 2 data and get some stats on how often an LZSS match is followed by another LZSS or a COPY.

After one LZSS, number of times COPY is next ... 867,693

After one LZSS, number of times LZSS is next ... 1,290,909

I also decided to take a look at how often each COPY length crops up.

COPY 1 occurs 291,422 times.

COPY 2 occurs 152,959 times.

COPY 3 occurs 101,741 times.

COPY 4 occurs 70,629 times.

COPY 5 occurs 48,244 times.

COPY 6 occurs 37,597 times.

COPY 7 occurs 27,232 times.

COPY 8 occurs 23,353 times.

COPY 9 occurs 16,682 times.

COPY 10 occurs 16,159 times.

COPY 11 occurs 12,960 times.

COPY 12 occurs 11,062 times.

COPY 13 occurs 8,580 times.

COPY 14 occurs 8,237 times.

COPY 15 occurs 5,806 times.

COPY >15 occurs 40,045 times.

Number Of Copies 872,708, Total Bytes Copied 3,882,406.

So, over half of the copies were for 3 bytes-or-less.

I think that you'll find that the reason that LZSS isn't used anymore on modern computers is more to do with its traditional 4096-byte window and 4-bit length count not matching modern data well, especially with things like 32-bit RGBA pixels and CPU instructions.

LZ4 definitely gets better compression than LZSS ... but then virtually everything else beats LZ4, and sometimes with only a little extra cost in cycles. LZ4 also has better worst-case compression behavior than LZSS, but that is usually more of a practical concern in the real-time compression situations that LZ4 was designed for, than things like games where you have other ways to deal with uncompressable data.

I still stand by what I said before ... but since it's just my opinion, you're welcome to feel differently.

IMHO, for speed and simplicity, it's hard to beat LZSS on an 8-bit CPU like the 6502.

If you really want good compression, then there are better options than LZ4, and you probably won't notice the cost in CPU cycles.

Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream.

Even if I agreed with you, which I don't, why-on-earth would a few extra bytes in the decompression code really make any difference?

You're still talking about a routine that will quite-happily fit into either zero-page, or unused stack-space, if you're trying to decompress a whole program on load.

... And if you're doing that on an Atari, then using a better compression scheme (such as aPlib) would give you less data to load through the SIO, and save more loading time than the little extra CPU cost in the decompression code would take.

BTW ... apart from our difference of opinion, which is mostly of a theoretical nature, rather than something that should have any effect on rensoup's decision over what to use, I've got to say that your LZ4 decompressor is *very* nicely written.

Your "docopy" inner-loop is a beautiful example of using the X register for holding both the low-byte of the copy-count and the offset at the same time! :-)

ivop · May 17, 2019

Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :]

My idea has always been split into 9 streams, delta and RLE. Never got to implementing it yet. I think RLE will be enough to shrink this kind of data enormously.

flashjazzcat · May 17, 2019

Not to pretend I have any experience with LZ compression, since I don't, but I achieved good results with RLE compression on data with a reasonable amount of repetition by compressing/decompressing on a row/column basis rather than a continuous linear stream. Obviously this lends itself well to data which has conspicuous repeated runs in columns, such as a bit-mapped logo or title. But it lends itself to less obvious situations as well, such as 8x8 font data. Compressing a typical 1KB font as an 8x128 array yields enormously long runs of $00 which typically shaves a couple of hundred bytes of the length of the compressed data.

Anyway: this is something I arrived at while messing around, but it may already be taken into account here. The decompressor remains wonderfully lightweight.

elmer · May 17, 2019

I hope this won't bore those of you reading this, but this thread seems like a good place to give a real-world example of the compression results when using different LZ-based compressors, so that folks can get a better idea of how small the improvement is that we're talking about for the more-advanced compressors.

BTW, SWD is the compressor that I wrote to try to see how good a job I could do. The only difference between the SWD4 and SWD5 compressors, is the switch from using the traditional 1-bit LZSS-style flag for literals, to using the LZ4-style count value for literals. It's a good indication of whether using the LZ4-style count is really much of an improvement.

These results are from testing one game's code/data files on the PC Engine again. The results should be pretty-similar with Atari data.

Total uncompressed data size             : 23,278,006
Total YsIV compression                   : 10,347,187 (LZSS with 4-bit count and  8-bit offset)
Total Anearth compression (no preload)   :  9,706,570 (LZSS with 4-bit count and 12-bit offset)
Total Legend of Xanadu 1 compression     :  8,937,897 (roughly equivalent to LZ4)
Total Legend of Xanadu 2 compression     :  8,324,785
Total Elmer's SWD4 compression           :  8,114,319 (with traditional 1-bit for literal/match)
Total Elmer's SWD5 compression           :  8,035,016 (with LZ4-style count of literals)
Total Jørgen Ibsen's aPLib compression   :  7,782,457
Total Pasi Ojala's puCrunch compression  :  7,780,261
Total Magnus Lind's Exomizer             :  7,377,469


Here's a direct comparison between the LZ4, SWD4&5 and Legend of Xanadu compressors, when testing the graphics files from this site ...

http://www.brutaldeluxe.fr/products/crossdevtools/lz4/

LoX1  6,490  LoX2  5,844  Swd4  5,546  Swd5  5,556  LZ4  6,508  ANGELFISH.BIN
LoX1 23,379  LoX2 20,771  Swd4 20,617  Swd5 20,665  LZ4 23,520  ASTRONUT.BIN
LoX1 15,087  LoX2 14,027  Swd4 13,844  Swd5 13,863  LZ4 14,802  BEHEMOTH.BIN
LoX1  3,406  LoX2  2,519  Swd4  2,446  Swd5  2,449  LZ4  2,803  BIG.BIN
LoX1  9,329  LoX2  8,003  Swd4  7,933  Swd5  7,962  LZ4  8,865  BUTTERFLY.BIN
LoX1  7,039  LoX2  6,206  Swd4  6,102  Swd5  6,113  LZ4  6,654  CD.BIN
LoX1 18,368  LoX2 16,874  Swd4 16,335  Swd5 16,369  LZ4 18,876  CLOWN.BIN
LoX1  7,982  LoX2  7,215  Swd4  7,235  Swd5  7,260  LZ4  7,662  COGITO.BIN
LoX1 14,871  LoX2 12,748  Swd4 12,560  Swd5 12,581  LZ4 15,300  COTTAGE.BIN
LoX1 13,389  LoX2 11,873  Swd4 11,711  Swd5 11,733  LZ4 13,102  FIGHTERS.BIN
LoX1 13,431  LoX2 12,260  Swd4 11,936  Swd5 11,954  LZ4 13,220  FLOWER.BIN
LoX1 10,198  LoX2  9,029  Swd4  8,804  Swd5  8,813  LZ4  9,972  JAZZ.BIN
LoX1 14,834  LoX2 13,403  Swd4 13,182  Swd5 13,206  LZ4 14,810  KNIFE.BIN
LoX1 19,485  LoX2 17,932  Swd4 17,727  Swd5 17,730  LZ4 20,261  LORI.BIN
LoX1  8,724  LoX2  8,143  Swd4  7,893  Swd5  7,905  LZ4  8,643  MAX.BIN
LoX1 17,921  LoX2 15,114  Swd4 14,798  Swd5 14,803  LZ4 18,474  OWL.BIN
LoX1 20,625  LoX2 18,564  Swd4 18,285  Swd5 18,307  LZ4 20,595  RED.DRAGON.BIN
LoX1 15,496  LoX2 13,993  Swd4 13,483  Swd5 13,500  LZ4 16,306  TAJ.BIN
LoX1 12,907  LoX2 11,205  Swd4 11,128  Swd5 11,164  LZ4 12,551  TUT.BIN

My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor.

Similarly, if you're really looking for compression, then LZ4 just doesn't make much sense against something like Jørgen Ibsen's aPLib, which is basically just yet-another LZ compressor, but with super-optimized encoding of the copy lengths and offsets.

elmer · May 17, 2019

Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream.

Here you go, 118-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes.

<EDIT>

Modified from original to include dmsc's code for reading the nibbles, and also to include an alternate HuC6280 version.

<EDIT2>

Improved the nibble extraction, cutting 14-cycles-per-byte off the cost with no more bytes used. This is DONE!

lz_decompress:  ldx     #0                      ; Initialize empty nibble.
                stx     <lz_nibble

                ldy     #0                      ; Initialize source index.

                sec
.load_command:  lda     [lz_srcptr],y           ; Reload an empty bit-buffer.
                iny
                bne     .skip1
                inc     <lz_srcptr + 1
.skip1:         ror     a
                sta     <lz_bitbuf
                bne     .got_command            ; Always true.

.next_command:  lsr     <lz_bitbuf              ; Get next command bit.
                beq     .load_command           ; Is the bit-buffer empty?

.got_command:   lda     [lz_srcptr],y           ; Get literal/offset byte.
                iny
                beq     .next_page
                bcc     .match                  ; CS=literal, CC=match.

.literal:       dex                             ; X == 0 from last copy loop.
                inc     <lz_dstptr + 0          ; Write the byte directly to
                beq     .write_byte             ; the output.
                dec     <lz_dstptr + 1
                bcs     .write_byte             ; Always true.

.next_page:     inc     <lz_srcptr + 1
                bcs     .literal                ; CS=literal, CC=match.

.match:         adc     <lz_dstptr + 0          ; Window range $FFFF..$FF00.
                sta     <lz_winptr + 0
                lda     <lz_dstptr + 1
                adc     #$FF
                sta     <lz_winptr + 1          ; Window never in ZP, so NZ.

                lda     <lz_nibble              ; Is there a nibble waiting?
                bne     .got_nibble             ; X == 0 from last copy loop.

                lda     [lz_srcptr],y
                beq     .finished               ; Value = 0 == Finished.
                iny
                bne     .skip2
                inc     <lz_srcptr + 1

.skip2:         tax
                lsr     a
                lsr     a
                lsr     a
                lsr     a

.got_nibble:    stx     <lz_nibble              ; Next nibble.
                and     #$0F                    ; Current nibble.

                tax
                sec                             ; Value 1..15 = Count 2..16.
                adc     <lz_winptr + 0
                sta     <lz_winptr + 0
                bcs     .skip3
                dec     <lz_winptr + 1

.skip3:         txa
                sec                             ; Value 1..15 = Count 2..16.
                adc     <lz_dstptr + 0
                sta     <lz_dstptr + 0
                bcs     .skip4
                dec     <lz_dstptr + 1

.skip4:         txa                             ; Value 1..15 = Offset $FE..$F0.
                eor     #$FF
                tax

lz_winptr       =       * + 1
.copy_loop:     lda     $FFFF,x
lz_dstptr       =       * + 1
.write_byte:    sta     $FFFF,x
                inx
                bne     .copy_loop
                inc     <lz_dstptr + 1
                jmp     .next_command

.finished:      rts

lz_srcptr       dw      0
lz_bitbuf       db      0
lz_nibble       db      0

Edited May 20, 2019 by elmer

elmer · May 17, 2019

Anyway: this is something I arrived at while messing around, but it may already be taken into account here. The decompressor remains wonderfully lightweight.

This is excellent advice ... rearranging data to improve compression is a time-honored technique!

It is basically the same basic idea as dmsc very-sensibly suggested earlier on ... there is absolutely nothing that *requires* the LZ4/LZSS window-onto-past-history to point into the data that you're decompressing right now, it can just-as-easily point into a previous frame's data if you're streaming.

That's not how compressors are usually set up, but it's an easy modification to make if you've got the source-code and a willingness to modify and compile your own tools.

dmsc · May 18, 2019

Hi!

It's great to have this kind of informed discussion here

We'll have to agree to disagree on that.

That's good!

The LZSS bitshift, traditionally of a zero-page location, is much quicker than all of the setup that LZ4 has to do for a copy-literals-from-compressed-data.

LZ4 will only start to outpace a well-written 6502 LZSS decompressor *if* the count of literals to copy at one time is large-enough.

Yes, but I think that copying more than 3 bytes already is faster in my LZ4 implementation, making most literal copies faster.

Here's a post that I made on the subject a few years ago, regarding the game data for a PC Engine translation (which, if you're not familiar with the console, runs a customized 7.16MHz 65C02 variant with a bunch of extra, and highly useful, new instructions) ...

I think that you'll find that the reason that LZSS isn't used anymore on modern computers is more to do with its traditional 4096-byte window and 4-bit length count not matching modern data well, especially with things like 32-bit RGBA pixels and CPU instructions.

LZ4 definitely gets better compression than LZSS ... but then virtually everything else beats LZ4, and sometimes with only a little extra cost in cycles. LZ4 also has better worst-case compression behavior than LZSS, but that is usually more of a practical concern in the real-time compression situations that LZ4 was designed for, than things like games where you have other ways to deal with uncompressable data.

I still stand by what I said before ... but since it's just my opinion, you're welcome to feel differently.

I think that a common ground is:

- If your window size is small, LZSS will provide better compression, as you can use less bits for the encoding of small matches.

- If the window size is big and the data is not too repetitive, LZ4 will provide better compression for a comparable speed.

IMHO, for speed and simplicity, it's hard to beat LZSS on an 8-bit CPU like the 6502.

I think that simplicity it about the same for LZ4 and LZSS. Also, you could include other LZ77 formats like LZO (with an encoding a little more complicated that LZ4).

If you really want good compression, then there are better options than LZ4, and you probably won't notice the cost in CPU cycles.

I agree, but for a streaming application, each cycle counts. Also, I think that for any increase in compression ratio you must use Huffman or other variable length coding, and this kills your ram usage and speed in the 6502.

Even if I agreed with you, which I don't, why-on-earth would a few extra bytes in the decompression code really make any difference?

You're still talking about a routine that will quite-happily fit into either zero-page, or unused stack-space, if you're trying to decompress a whole program on load.

... And if you're doing that on an Atari, then using a better compression scheme (such as aPlib) would give you less data to load through the SIO, and save more loading time than the little extra CPU cost in the decompression code would take.

Because the *total* size of the data is what matters, so you need to compare the size of the compressed data plus the decompressor to get the "real" compression ratio.

BTW ... apart from our difference of opinion, which is mostly of a theoretical nature, rather than something that should have any effect on rensoup's decision over what to use, I've got to say that your LZ4 decompressor is *very* nicely written.

Your "docopy" inner-loop is a beautiful example of using the X register for holding both the low-byte of the copy-count and the offset at the same time! :-)

Thanks!

If you look at my FastBasic sources you can see that I use the same technique to implement the MOVE statement: https://github.com/dmsc/fastbasic/blob/master/src/interp/move.asm

Much more difficult is the complement "backward move", as you can only copy 255 bytes in the inner loop: https://github.com/dmsc/fastbasic/blob/master/src/interp/nmove.asm

I hope this won't bore those of you reading this, but this thread seems like a good place to give a real-world example of the compression results when using different LZ-based compressors, so that folks can get a better idea of how small the improvement is that we're talking about for the more-advanced compressors.

BTW, SWD is the compressor that I wrote to try to see how good a job I could do. The only difference between the SWD4 and SWD5 compressors, is the switch from using the traditional 1-bit LZSS-style flag for literals, to using the LZ4-style count value for literals. It's a good indication of whether using the LZ4-style count is really much of an improvement.

Did you implement optimal parsing? (or at least, lazy parsing with at least checking 4 bytes in advance)

Most papers say that you can get a lot of extra compression ration with that (and it is what LZ4HC compressor does).

These results are from testing one game's code/data files on the PC Engine again. The results should be pretty-similar with Atari data.

Total uncompressed data size             : 23,278,006
Total YsIV compression                   : 10,347,187 (LZSS with 4-bit count and  8-bit offset)
Total Anearth compression (no preload)   :  9,706,570 (LZSS with 4-bit count and 12-bit offset)
Total Legend of Xanadu 1 compression     :  8,937,897 (roughly equivalent to LZ4)
Total Legend of Xanadu 2 compression     :  8,324,785
Total Elmer's SWD4 compression           :  8,114,319 (with traditional 1-bit for literal/match)
Total Elmer's SWD5 compression           :  8,035,016 (with LZ4-style count of literals)
Total Jørgen Ibsen's aPLib compression   :  7,782,457
Total Pasi Ojala's puCrunch compression  :  7,780,261
Total Magnus Lind's Exomizer             :  7,377,469


Here's a direct comparison between the LZ4, SWD4&5 and Legend of Xanadu compressors, when testing the graphics files from this site ...

http://www.brutaldeluxe.fr/products/crossdevtools/lz4/

LoX1  6,490  LoX2  5,844  Swd4  5,546  Swd5  5,556  LZ4  6,508  ANGELFISH.BIN
LoX1 23,379  LoX2 20,771  Swd4 20,617  Swd5 20,665  LZ4 23,520  ASTRONUT.BIN
LoX1 15,087  LoX2 14,027  Swd4 13,844  Swd5 13,863  LZ4 14,802  BEHEMOTH.BIN
LoX1  3,406  LoX2  2,519  Swd4  2,446  Swd5  2,449  LZ4  2,803  BIG.BIN
LoX1  9,329  LoX2  8,003  Swd4  7,933  Swd5  7,962  LZ4  8,865  BUTTERFLY.BIN
LoX1  7,039  LoX2  6,206  Swd4  6,102  Swd5  6,113  LZ4  6,654  CD.BIN
LoX1 18,368  LoX2 16,874  Swd4 16,335  Swd5 16,369  LZ4 18,876  CLOWN.BIN
LoX1  7,982  LoX2  7,215  Swd4  7,235  Swd5  7,260  LZ4  7,662  COGITO.BIN
LoX1 14,871  LoX2 12,748  Swd4 12,560  Swd5 12,581  LZ4 15,300  COTTAGE.BIN
LoX1 13,389  LoX2 11,873  Swd4 11,711  Swd5 11,733  LZ4 13,102  FIGHTERS.BIN
LoX1 13,431  LoX2 12,260  Swd4 11,936  Swd5 11,954  LZ4 13,220  FLOWER.BIN
LoX1 10,198  LoX2  9,029  Swd4  8,804  Swd5  8,813  LZ4  9,972  JAZZ.BIN
LoX1 14,834  LoX2 13,403  Swd4 13,182  Swd5 13,206  LZ4 14,810  KNIFE.BIN
LoX1 19,485  LoX2 17,932  Swd4 17,727  Swd5 17,730  LZ4 20,261  LORI.BIN
LoX1  8,724  LoX2  8,143  Swd4  7,893  Swd5  7,905  LZ4  8,643  MAX.BIN
LoX1 17,921  LoX2 15,114  Swd4 14,798  Swd5 14,803  LZ4 18,474  OWL.BIN
LoX1 20,625  LoX2 18,564  Swd4 18,285  Swd5 18,307  LZ4 20,595  RED.DRAGON.BIN
LoX1 15,496  LoX2 13,993  Swd4 13,483  Swd5 13,500  LZ4 16,306  TAJ.BIN
LoX1 12,907  LoX2 11,205  Swd4 11,128  Swd5 11,164  LZ4 12,551  TUT.BIN

My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor.

Similarly, if you're really looking for compression, then LZ4 just doesn't make much sense against something like Jørgen Ibsen's aPLib, which is basically just yet-another LZ compressor, but with super-optimized encoding of the copy lengths and offsets.

Well, I agree, but I *do* think that LZ4 can be faster, specially for streaming images like my example. This is because you can copy large parts of one frame to the next.

Here you go, 124-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes.

LZ_GET_SRC is a macro which could be changed to a JSR if someone *really* wanted to save a few more bytes.

Very good code!

I like the way you extract 4 bits for each match from the length byte, but I think this could work better, with the only restriction of adding one zero byte to signal end of stream if not already 0:

                lda     <lz_nibble
                bne     .ok
                LZ_GET_SRC
                beq     .finish
ok:
                tax
                lsr     a
                lsr     a
                lsr     a
                lsr     a
                sta     <lz_nibble
                txa
                and     #$0F

This should be 9 bytes less (if I counted right).

I think that a "flexible" LZ77 compression library, supporting different coding schemes and multiple dictionary support (for implementing last-frame and current-frame matching) could be a great project.

Then, you could experiment with increasing offset lengths, implementing "address instead of offsets" (like my modified LZ4) and many other variations to search the best compression/speed for a given application.

Have fun!

Heaven/TQA · May 18, 2019

Ok. Noob question.

For the 65816 SNES rom depack what would you recommend?

popmilo · May 18, 2019

Guys you are doing a great job on this topic ! Thanks for all this info !

Noob question 2
- what would you recommend for compression of sound samples ?

Irgendwer · May 18, 2019

- what would you recommend for compression of sound samples ?

For the typical low-fi samples on "our" machines: Quantized delta encoding. Halving exactly the data size.

Edit:

That said, you have to go for for at least 15kHz replay to cope with the aliasing effects.

If the replay frequency is lower, going for 2 bit raw data isn't that bad:

Edit 2:

Of course it depends if you like to depack the data on-the-fly while replaying or are able to inflate the data before playing.

Both principles above are for real-time preparation of the data, saving RAM and not only floppy space.

Edited May 18, 2019 by Irgendwer

rensoup · May 18, 2019

My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor.

Well I got curious, so I took my 9 deinterleaved raw SAPR channels and compressed them with LZSS v0.01. and the total was about 4KB vs 2KB for LZ4 best setting.

That's just one very tiny set of data and perhaps that wasn't the best LZSS compressor...

Here you go, 124-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes.

LZ_GET_SRC is a macro which could be changed to a JSR if someone *really* wanted to save a few more bytes.

Well thanks for that!

That's not how compressors are usually set up, but it's an easy modification to make if you've got the source-code and a willingness to modify and compile your own tools.

The obvious reason for that would be that the compression ratio may be somewhat worse, I guess ? Well, call me lazy but I've not used C in years (I'm all C# these days) and I don't even think I installed it in my VS.

Edited May 18, 2019 by rensoup

dmsc · May 18, 2019

Hi!

Replying to a private message from @elmer, with his permission, and I simplified the code a little from the response I gave to his PM.

I've got a 35KB SAP-R, it's one of those fancy RMTs that use 40% of the frame for no good reason, seems that particular song doesn't use all the effects though. RMT size is 3.5KB

Btw, I know pretty much nothing about RMT, I just used the standard replay routine on that tune and SAP-Rd it.

compressed with LZ4 best setting, it takes 9.5KB

deinterleaved into 9 files and LZ4 on each, total about 2KB (Less than the original!). Obviously this is the way to go.

.....

Instead of running the decoder 9 times per frame to output 1 byte for each channel, I just output 9 bytes of 1 channel then next frame 9 bytes from the next channel so it's less costly to setup the decoder.

I use a double 9x9 buffer so it plays from one 9x9 buffer while I fill up the other one over 9 frames.

(At startup I just output all 9 bytes from all 9 buffers of course)

.....

For your method, you suggested that I compress at least 72 bytes chunks but is it relevant in this case ? Couldn't I compress bigger chunks like 512 bytes, as long as I modify the decoder to only output 9 bytes at a time and use that double buffer scheme ( it would hurt speed a little of course)...

First, replying to your last question, it is better do decompress a whole block (for example, 256 bytes) in one run, and then use a few bytes on each frame. This will make the code simpler and faster.

But...!

What you really want to do is implementing a SAP player for your RTM tune, so perhaps you could concentrate on that problem first.

You already discovered that each de-interleaved data compresses a lot, this means that each music channel is correlated with itself but independent on the others. This gives you a strong clue that this should be the method to use.

As @elmer said, you can use LZSS for its simplicity.

For example, you can start by simply coding using RLE (like you proposed before) each channel, using code like this:

 ; Decode one "frame" of data (one byte for each POKEY register)
sap_loop:
  ldx  #8

  ; Loop through all "streams", one for each POKEY register 
chn_loop:
  lsr  bit_data       ; Get next bit
  bne  got_bit
  jsr  get_byte       ; Not enough bits, refill!
  ror  a              ; Extract a new bit and add a 1 at the high bit (from C set above)
  sta  bit_data       ;
got_bit:
  bcc  skip           ; Bit = 0 (CC) is "repeat last byte again", bit = 1 (CS) is "get a new byte"
  jsr  get_byte
  sta  POKEY, x
skip:
  dex
  bpl  chn_loop

  ; continue.....

The coding uses one byte (bit_data) to hold bits that are 1 if the POKEY register changes or 0 if it keeps it value, and simply reads bytes from the stream when new bits or bytes are needed. Note that the *coded* bytes are interleaved, this makes the routine much simpler as you read bytes using only one pointer.

So a stream (in hex) like:

  $FF XX YY ZZ MM NN QQ RR SS $80 TT $00 $01 UU $00 ....

This reads as: 8 bits 1, so read 8 vales to pokey registers 0 to 7, then one bit 1 and 7 bits 0, read 1 value to pokey register 8, skip registers 0 to 6, then 8 bits 0, skip registers 7,8 and 0-5, then 7 bits 0 and one bit 1, skips registers 6,7,8,0,1,2 and 3, read one byte UU to register 4, and so on.

So, this is equivalent to the data:

 XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM UU QQ RR SS TT XX ....

Hope above example makes it clearer.

From that routine, you can add simple LZ77 matching, for example with distances from 0 to 15 and lengths from 2 to 17 (4 higher bits are length, lower 4 are position). Buffer is 9 * 16 = 144 bytes, as there are a 16 byte buffer for each "channel" (pokey register).

Code below completely untested, I think that @elmer could verify and perhaps optimize it a little:

sap_loop:
  ldx  #8

  ; Loop through all "channels", one for each POKEY register 
chn_loop:
  txa                 ; Get channel buffer address, used to store old channel data
  asl
  asl
  asl
  asl
  sta  cur_chan

  lda  chn_copy, x    ; Get status of this stream, if > 0 we are copying bytes
  beq  next_bit       ; No, we can get a new frame
  
copy_byte:
  dec  chn_copy, x    ; Decrease match length, increase match position
  lda  chn_pos, x
  and  #$0F
  sec
  adc  cur_chan
  sta  chn_pos, x
  tay

  ; Now, read old data, jump to data store
  lda  buffer, y
  jmp  store


next_bit:             ; We are decoding a new match/literal
  lsr  bit_data       ; Get next bit
  bne  got_bit
  jsr  get_byte       ; Not enough bits, refill!
  ror  a              ; Extract a new bit and add a 1 at the high bit (from C set above)
  sta  bit_data       ;
got_bit:
  jst  get_byte       ; Always read a byte, it could mean "match size/offset" or "literal byte"
  bcc  store          ; Bit = 0 is "literal", bit = 1 is "match"

  sta  chn_pos, x     ; Store "position" (no need to AND #$0F, it will be done above)
  adc  #$1F           ; Adds 2 to match length (C = 1 from above)
  lsr
  lsr
  lsr
  lsr
  sta  chn_copy, x    ; Store in "copy length"
  bne  do_copy_byte   ; go to copy first byte

store:
  sta  POKEY, x
  lda  cur_pos        ; Store into bufer, get current position + channel into Y
  adc  cur_chan
  tay
  sta  buffer, y

  dex
  bpl  chn_loop       ; Next channel

  ; continue.....

elmer · May 18, 2019

It's great to have this kind of informed discussion here

Yep, it's fun to find an opportunity to talk about compression techniques, even if we draw different conclusions from our experiences.

Yes, but I think that copying more than 3 bytes already is faster in my LZ4 implementation, making most literal copies faster.

Absolutely! The question then becomes ... do you get enough copies of more-than-3-bytes to make up for loss in speed on 3-or-less?

This question becomes even more interesting when you're programming for other 6502-variants, and have some extra CPU instructions to play with.

Anyway ... my personal testing showed that copies of 3-or-less were more common than copies of more-than-3.

It all depends upon the actual data that we're compressing.

I think that a common ground is:

- If your window size is small, LZSS will provide better compression, as you can use less bits for the encoding of small matches.

- If the window size is big and the data is not too repetitive, LZ4 will provide better compression for a comparable speed.

Yep, that's about right.

In my experience on 8-bit and 16-bit platforms, from the POV of compression-over-speed, you really get a benefit from finding an algorithm that can encode 2-byte matches efficiently.

Whether you then use the traditional 1-bit-for-copy-or-match vs the count-of-bytes-to-copy doesn't seem to matter much to the compression ratio in most data.

I agree, but for a streaming application, each cycle counts.

Which is where we get back to our slightly different opinions on which algorithm is faster in *typical* real-world cases (on 8-bit and 16-bit machines)! ;-)

Really ... the only answer is for any potential user to try both and see what the difference is with *their* data.

Also, I think that for any increase in compression ratio you must use Huffman or other variable length coding, and this kills your ram usage and speed in the 6502.

Going with Huffman, as in gzip, will certainly get you better compression ... but the CPU and space cost definitely doesn't work on a 6502.

Other variable-length encoding schemes (which LZ4 actually is with its variable-length-count) can definitely provide good results.

Just look at where puCrunch, aPlib, Exomizer, and my own SWD compressors fall in comparison to LZ4.

All of those compressors were written for use on 8-bit CPUs such as the 6502 ... just not for the relatively-rare circumstances of streaming applications.

Because the *total* size of the data is what matters, so you need to compare the size of the compressed data plus the decompressor to get the "real" compression ratio.

Isn't that an argument for using puCrunch, aPlib & Exomizer? ;-)

Did you implement optimal parsing? (or at least, lazy parsing with at least checking 4 bytes in advance)

Most papers say that you can get a lot of extra compression ration with that (and it is what LZ4HC compressor does).

Nope, not in my SWD encoding.

I used the official LZ4 compressor for testing ... I have no idea if that has lazy encoding.

Lazy encoding does sound like a good idea for *any* of the LZ-style compressors, and I've been meaning to try it out.

I like the way you extract 4 bits for each match from the length byte, but I think this could work better, with the only restriction of adding one zero byte to signal end of stream if not already 0:
                lda     <lz_nibble
                bne     .ok
                LZ_GET_SRC
                beq     .finish
ok:
                tax
                lsr     a
                lsr     a
                lsr     a
                lsr     a
                sta     <lz_nibble
                txa
                and     #$0F

OK, I've changed my LZSS source above to use that code, thanks! I knew that I was missing something when I wrote the nibble-handling.

It looks like it saves 8-bytes in the code ... but it costs an extra 7-cycles-per-byte with the bit-shifting, so it's not free.

I've also rearranged the LZSS code to show how to take advantage of branch-likelihood to shave more cycles off of the decompression, and that cost another 2-bytes, so the current code is actually down to 118 bytes (on the Atari).

I think that a "flexible" LZ77 compression library, supporting different coding schemes and multiple dictionary support (for implementing last-frame and current-frame matching) could be a great project.

Then, you could experiment with increasing offset lengths, implementing "address instead of offsets" (like my modified LZ4) and many other variations to search the best compression/speed for a given application.

Yep, that's what I'd call "an exercise for the reader".

I did that myself many years ago, that's where my SWD variants came from ... only to find out that Jørgen Ibsen had already come up with a better scheme! ;-)

elmer · May 18, 2019

What you really want to do is implementing a SAP player for your RTM tune, so perhaps you could concentrate on that problem first.

You already discovered that each de-interleaved data compresses a lot, this means that each music channel is correlated with itself but independent on the others. This gives you a strong clue that this should be the method to use.

Doh! ... I've only-just realized that rensoup is actually talking about *music* compression, silly me!

Yep, that's an absolute swine to compress if you've only got the data in a register-contents-per-frame format like SAP.

As dmsc just said ... the best thing for a format like that is to separate the channels and registers and then do whatever you can to find some repetition to remove.

Having said which ... if you only need to decompress 9-bytes of data per frame to feed the POKEY registers, then decompression speed really isn't your problem IMHO. You're going to be more concerned with managing ring-buffers.

Any compressor between rle and lz4 ?

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members