pirx Posted May 16, 2019 Share Posted May 16, 2019 Yes unfortunately for SAP-R, the window has to be 128-256 bytes for any decent compression, [...] I've just got this idea, what about using Huffman coding where alphabet is build from 72-bit letters (9 bytes). If the entropy is low (and it is in SAP Type-R), this would yeld well over 9 times compression. Just looked at the real data - 89KiB SAP contains 790 symbols with this frequency: 1,8,8,8,1,6,6,8,1,8,8,6,8,1,2,142,69,59,142,47,3,24,20,22,7,2,3,25,12,12,12,60,5,28,28,29,3,10,22,14,13,18,2,2,2,1,1,1,1,1,5,14,32,11,17,5,48,8,22,16,20,1,44,5,22,17,17,3,27,8,12,9,7,2,1,1,2,8,8,14,5,4,3,20,8,7,9,1,6,6,8,6,6,6,3,6,6,3,3,7,1,9,6,7,6,3,3,6,1,11,4,2,6,10,5,6,5,4,2,2,1,1,4,5,2,1,1,6,19,15,18,14,11,14,3,15,18,14,7,3,14,47,64,38,138,142,17,88,72,56,4,18,296,47,29,22,16,68,48,38,23,4,12,16,12,8,4,6,4,4,2,5,2,3,4,2,20,40,20,16,10,16,11,9,52,3,13,31,31,24,12,2,64,28,23,24,14,10,14,10,5,2,2,4,20,13,11,123,19,12,16,12,12,12,18,12,12,12,12,34,12,12,6,13,12,6,12,12,12,12,16,14,11,11,6,12,6,3,39,9,6,9,6,6,6,9,6,6,3,9,12,85,35,12,12,6,18,12,17,12,12,12,18,12,12,12,2,6,9,7,6,61,1,8,6,9,6,6,6,9,6,6,6,6,16,6,6,3,2,6,6,3,6,6,6,6,14,10,1,5,1,6,9,6,62,7,6,7,6,6,6,9,6,6,6,6,18,6,6,3,2,6,6,6,6,6,6,9,6,6,6,1,8,2,3,6,2,6,11,5,8,6,9,6,6,6,9,6,6,1,1,1,1,2,7,4,1,1,2,1,1,1,1,1,1,1,8,6,4,2,1,1,1,2,2,2,2,1,1,1,1,4,32,84,4,4,51,8,32,83,5,2,5,57,6,16,22,2,4,1,2,59,1,2,8,6,4,4,8,6,8,8,4,6,4,4,8,4,10,9,4,3,2,2,4,3,4,4,2,3,2,2,4,2,5,4,2,3,2,4,3,4,4,10,1,11,11,11,1,5,108,66,52,48,90,97,92,78,71,92,90,103,15,74,66,7,26,70,101,54,50,20,2,4,10,14,8,8,6,4,6,4,2,2,4,6,4,2,4,2,2,4,5,5,4,2,17,37,21,16,6,4,1,5,4,4,4,6,4,2,4,3,1,4,2,10,14,8,8,14,12,17,12,9,8,12,8,1,7,8,1,5,10,14,8,8,4,3,4,15,10,8,8,1,4,6,4,4,4,4,8,4,4,2,5,13,17,13,12,8,22,6,24,60,88,47,43,29,29,31,24,3,10,13,12,10,6,9,22,14,10,3,15,13,15,12,6,5,3,1,14,17,19,13,9,1,9,5,3,4,1,5,4,4,4,4,1,4,4,5,4,4,2,15,35,52,28,24,42,28,3,39,28,18,6,22,25,26,16,23,3,16,13,17,23,18,5,14,11,7,16,6,4,6,4,2,4,4,4,2,2,4,2,2,6,3,6,6,6,10,1,3,7,13,10,3,12,10,5,3,15,12,18,12,6,6,12,15,10,3,12,10,5,3,1,1,1,1,1,1,1,1,1,4,9,15,7,7,1,7,6,1,7,6,1,1,10,15,10,10,6,4,1,1,1,2,2,1,1,1,1,1,2,30,5,3,7,5,71,6,6,7,7,1,8,1,9,6,6,11,2,5,2,6,1,1,2,1,3,3,1,10,10,1,1,2,1,2,4,3,2,2,2,2,1,2,3,2,2,2,1,2,2,3,2,1,1,1,1,2,2,1,2,2,1,1,1,1,1,1 a lot of singletons, unfortunately, but still looks promising. The only issue will be the large symbol table (do not remember the correct name). Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :] Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 16, 2019 Author Share Posted May 16, 2019 (edited) Pure LZSS might be what you're looking for. The compression ratio is worse than LZ4, but the coding for the decompressor is absolutely trivial. It is what a lot of games in the late 1980s and early 1990s used before programmers tweaked the details of the algorithm to get even better compression, but at some cost in speed. Thanks, I found a 65c816 decoder for it ( https://forums.nesdev.com/viewtopic.php?f=12&t=5771) but its author says it requires 4KB of mem... that unfortunately seems like a no go for me! It's a really simple concept, but, as you've found, you need to keep the decompressed data around in order to be referenced. This usually isn't a problem, but you do need to know that you've got to do that. Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length. I'm just surprised that nobody has done that. I actually gave it a shot yesterday... it was a lot more annoying than I thought but it seems to work. The gains vary from nothing to about 25% compared to regular RLE and I get the feeling there's more that can be done. LZ4 was optimized for speed of compression/decompression for use in realtime on 32-bit CPUs ... I've yet to find a case where it actually makes sense to use it on 8-bit or 16-bit computers/consoles. Something like Jorgen Ibsen's aPlib offers better compression, with only a little extra cost in speed ... and if you really need the speed, then pure LZSS is hard to beat. At 300-600 bytes/frame I would agree but dmsc's LZ4 decoder benchmark seems promising though. Edited May 16, 2019 by rensoup Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 16, 2019 Author Share Posted May 16, 2019 Did you benchmark LZO? It's supposed to be designed around decompression speed. Also, if the idea is just to blast this to the screen as you decompress, you could even add a special flag to the decompressor which says "Move to this address before continuing to decompress". Then you can just let the decompression loose on an entire run of data to do a frame update without having to write "same" data to parts of the screen. found this post: https://forums.nesdev.com/viewtopic.php?f=2&t=13671 I started with the assumption LZO would be best of the no-RAM algos, and indeed it was. Its speed was a surprise however, on desktops it is extremely fast, here it's over twice as slow as zlib. Being GPL it is not usable for many closed projects, though commercial licenses are available. ouch... Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 16, 2019 Author Share Posted May 16, 2019 no. compressed data: 15kb decompresed: 35147 bytes so: 35147 / 55 = 639 bytes Ooops you're right. dmsc's version seems a lot faster though. Quote Link to comment Share on other sites More sharing options...
dmsc Posted May 16, 2019 Share Posted May 16, 2019 Hi! Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length. I'm just surprised that nobody has done that. I actually gave it a shot yesterday... it was a lot more annoying than I thought but it seems to work. The gains vary from nothing to about 25% compared to regular RLE and I get the feeling there's more that can be done. LZ4 is exactly that, it stores offsets to old data instead of repeating it. LZSS is not used anymore because decompression is slower than LZ4, because you have to do a bit shift for each literal (i.e. copied) byte, as opposed to LZ4 where you have a count of literal bytes beforehand. Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream. Remember that you need to compare the memory needs of the decoder plus the buffers plus the data structures when comparing the compression performance. This is why using Huffman coding is not very appealing, you need about 512 bytes of tables to hold the trees and codes, those bytes could be better used as a buffer for LZ* compression. Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 16, 2019 Author Share Posted May 16, 2019 (edited) In my code above I made an optimization: I replace the "copy distance" field with a "copy address" field (= position - distance), so the copy can be done directly. This also allows (by modifying the compressor) to use *any* place in the target memory as a source, so you could stream frames using the read buffer and the target screen, so you can send only changed bytes. Ok that was confusing... you're saying that instead of of using the regular LZ4 encoder, I could use the LZ4 api to set the LZ4 dictionary to come from the previously compressed source data, instead of the decompressed destination ? something like: For every new chunk of 9 bytes to compress: -Set dictionary to be compressed source data, so the dictionary grows a bit more for every 9 byte chunk which in theory improves compression every time -Compress 9 bytes using new dictionary. ? LZ4 is exactly that, it stores offsets to old data instead of repeating it. Yes that bit I know, but these are offset in the decompressed data but from you're saying above I could change that to use offsets in the compressed source ? Edited May 16, 2019 by rensoup Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 16, 2019 Author Share Posted May 16, 2019 I've just got this idea, what about using Huffman coding where alphabet is build from 72-bit letters (9 bytes). If the entropy is low (and it is in SAP Type-R), this would yeld well over 9 times compression. Just looked at the real data - 89KiB SAP contains 790 symbols with this frequency: 1,8,8,8,1,6,6,8,1,8,8,6,8,1,2,142,69,59,142,47,3,24,20,22,7,2,3,25,12,12,12,60,5,28,28,29,3,10,22,14,13,18,2,2,2,1,1,1,1,1,5,14,32,11,17,5,48,8,22,16,20,1,44,5,22,17,17,3,27,8,12,9,7,2,1,1,2,8,8,14,5,4,3,20,8,7,9,1,6,6,8,6,6,6,3,6,6,3,3,7,1,9,6,7,6,3,3,6,1,11,4,2,6,10,5,6,5,4,2,2,1,1,4,5,2,1,1,6,19,15,18,14,11,14,3,15,18,14,7,3,14,47,64,38,138,142,17,88,72,56,4,18,296,47,29,22,16,68,48,38,23,4,12,16,12,8,4,6,4,4,2,5,2,3,4,2,20,40,20,16,10,16,11,9,52,3,13,31,31,24,12,2,64,28,23,24,14,10,14,10,5,2,2,4,20,13,11,123,19,12,16,12,12,12,18,12,12,12,12,34,12,12,6,13,12,6,12,12,12,12,16,14,11,11,6,12,6,3,39,9,6,9,6,6,6,9,6,6,3,9,12,85,35,12,12,6,18,12,17,12,12,12,18,12,12,12,2,6,9,7,6,61,1,8,6,9,6,6,6,9,6,6,6,6,16,6,6,3,2,6,6,3,6,6,6,6,14,10,1,5,1,6,9,6,62,7,6,7,6,6,6,9,6,6,6,6,18,6,6,3,2,6,6,6,6,6,6,9,6,6,6,1,8,2,3,6,2,6,11,5,8,6,9,6,6,6,9,6,6,1,1,1,1,2,7,4,1,1,2,1,1,1,1,1,1,1,8,6,4,2,1,1,1,2,2,2,2,1,1,1,1,4,32,84,4,4,51,8,32,83,5,2,5,57,6,16,22,2,4,1,2,59,1,2,8,6,4,4,8,6,8,8,4,6,4,4,8,4,10,9,4,3,2,2,4,3,4,4,2,3,2,2,4,2,5,4,2,3,2,4,3,4,4,10,1,11,11,11,1,5,108,66,52,48,90,97,92,78,71,92,90,103,15,74,66,7,26,70,101,54,50,20,2,4,10,14,8,8,6,4,6,4,2,2,4,6,4,2,4,2,2,4,5,5,4,2,17,37,21,16,6,4,1,5,4,4,4,6,4,2,4,3,1,4,2,10,14,8,8,14,12,17,12,9,8,12,8,1,7,8,1,5,10,14,8,8,4,3,4,15,10,8,8,1,4,6,4,4,4,4,8,4,4,2,5,13,17,13,12,8,22,6,24,60,88,47,43,29,29,31,24,3,10,13,12,10,6,9,22,14,10,3,15,13,15,12,6,5,3,1,14,17,19,13,9,1,9,5,3,4,1,5,4,4,4,4,1,4,4,5,4,4,2,15,35,52,28,24,42,28,3,39,28,18,6,22,25,26,16,23,3,16,13,17,23,18,5,14,11,7,16,6,4,6,4,2,4,4,4,2,2,4,2,2,6,3,6,6,6,10,1,3,7,13,10,3,12,10,5,3,15,12,18,12,6,6,12,15,10,3,12,10,5,3,1,1,1,1,1,1,1,1,1,4,9,15,7,7,1,7,6,1,7,6,1,1,10,15,10,10,6,4,1,1,1,2,2,1,1,1,1,1,2,30,5,3,7,5,71,6,6,7,7,1,8,1,9,6,6,11,2,5,2,6,1,1,2,1,3,3,1,10,10,1,1,2,1,2,4,3,2,2,2,2,1,2,3,2,2,2,1,2,2,3,2,1,1,1,1,2,2,1,2,2,1,1,1,1,1,1 a lot of singletons, unfortunately, but still looks promising. The only issue will be the large symbol table (do not remember the correct name). Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :] Damn and I was hoping for an existing solution Quote Link to comment Share on other sites More sharing options...
dmsc Posted May 16, 2019 Share Posted May 16, 2019 Hi! Ok that was confusing... you're saying that instead of of using the regular LZ4 encoder, I could use the LZ4 api to set the LZ4 dictionary to come from the previously compressed source data, instead of the decompressed destination ? something like: For every new chunk of 9 bytes to compress: -Set dictionary to be compressed source data, so the dictionary grows a bit more for every 9 byte chunk which in theory improves compression every time -Compress 9 bytes using new dictionary. ? Yes that bit I know, but these are offset in the decompressed data but from you're saying above I could change that to use offsets in the compressed source ? Yes. In my source code, to simplify the decoder, I parse all the LZ4 blocks, replacing the offsets (that are stored as "bytes before the current position") to the actual address from which to copy the data. This is possible as I know all the addresses that the 6502 code will use. Also, if you want to compress blocks of only 9 bytes, the compression will not work very well (as the minimal block size in LZ4 is 3 bytes). Try to compress at least 8 frames (72 bytes) on each block. As I said, using 150 bytes for the code and then trying to spare bytes in the block is not helpful, as you *always* have buffers in memory (i.e., the disk-reading buffer, the 6502 stack, etc.). Quote Link to comment Share on other sites More sharing options...
dmsc Posted May 16, 2019 Share Posted May 16, 2019 Hi! In my source code, to simplify the decoder, I parse all the LZ4 blocks, replacing the offsets (that are stored as "bytes before the current position") to the actual address from which to copy the data. This is possible as I know all the addresses that the 6502 code will use. Also, if you want to compress blocks of only 9 bytes, the compression will not work very well (as the minimal block size in LZ4 is 3 bytes). Try to compress at least 8 frames (72 bytes) on each block. As I said, using 150 bytes for the code and then trying to spare bytes in the block is not helpful, as you *always* have buffers in memory (i.e., the disk-reading buffer, the 6502 stack, etc.). Also, another point. The LZ4 format specifies that the las block *must* have 5 literal bytes (an 8 byte block), this is to simplify writing decoders for 32 bit CPUs. In the 6502, the last block could have only one literal, you can modify the line #define LASTLITERALS 5 /* see ../doc/lz4_Block_format.md#parsing-restrictions */ I set it to 1 and reduced the block sizes. Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 17, 2019 Author Share Posted May 17, 2019 Hi! Yes. Well that's some weird stuff Not sure it's worth it, let me pm you to explain... Quote Link to comment Share on other sites More sharing options...
elmer Posted May 17, 2019 Share Posted May 17, 2019 (edited) Thanks, I found a 65c816 decoder for it ( https://forums.nesdev.com/viewtopic.php?f=12&t=5771)but its author says it requires 4KB of mem... that unfortunately seems like a no go for me! You just found the source for one particular decoder that uses a fixed ring-buffer in memory. You wouldn't need that on the Atari. The ring-buffer technique is useful when decompressing to an area that you can't randomly read from, such as the VRAM in a NES. The LZSS algorithm has no more overhead than LZ4. BTW, traditional LZSS uses a 16-bit value for each match, 12-bits for the match position and 4-bits for the match length. Various games changed this to 8-bits for match position and 4-bits for match length. That hurts compression ratio, but it can make the decompressor a bit faster, and it doesn't hurt you much if you're decompressing smallish amounts of data at a time. found this post: https://forums.nesdev.com/viewtopic.php?f=2&t=13671 ouch... That's another terrible example of decompression, because the OP is comparing one hand-optimized assembly-language decompressor to a bunch of decompressors compiled from C using CC65 ... which is a terrible idea! Well that's why I was hoping for a modified RLE, because the non repeated bytes are already stored decompressed in the source. So if some sequences are repeated, you could just store an offset and length. I'm just surprised that nobody has done that. It sounds like you're talking about using an offset/length into the *compressed* data stream, rather than the *decompressed* data stream. Most people come up with that idea at some point, but when they try to implement it, they find that it doesn't work well in practice, and your compression ratio goes all-to-heck (compared to any of the LZ schemes). Damn and I was hoping for an existing solution This is the most important thing that you've said so far. If you really want to just take existing code, then dmsc has offered you a perfectly-usable, and pretty-optimal solution. The key for any of the LZ variants is to get that copy-loop as fast as you can. dsmc's inner loop for that is about as good as you're going to get on the original NMOS 6502. That inner loop will work on *any* of the LZ variants. Beyond that, speedups tend to come from removing subroutine calls and inlining stuff. Edited May 17, 2019 by elmer 1 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 17, 2019 Share Posted May 17, 2019 LZSS is not used anymore because decompression is slower than LZ4, because you have to do a bit shift for each literal (i.e. copied) byte, as opposed to LZ4 where you have a count of literal bytes beforehand. We'll have to agree to disagree on that. The LZSS bitshift, traditionally of a zero-page location, is much quicker than all of the setup that LZ4 has to do for a copy-literals-from-compressed-data. LZ4 will only start to outpace a well-written 6502 LZSS decompressor *if* the count of literals to copy at one time is large-enough. Here's a post that I made on the subject a few years ago, regarding the game data for a PC Engine translation (which, if you're not familiar with the console, runs a customized 7.16MHz 65C02 variant with a bunch of extra, and highly useful, new instructions) ... One of the "big ideas" in the currently-popular LZ4 compression scheme is that you can COPY multiple bytes at a time, rather than using a bit for each one to say that it's a COPY and not an LZSS match. This also allows LZ4 to guarantee that a COPY is followed by an LZSS match, which saves another bit. Unfortunately, if an LZSS match directly follows another LZSS match, then 4 bits are wasted to indicate a COPY length of zero. So I decided to look at the Xanadu 2 data and get some stats on how often an LZSS match is followed by another LZSS or a COPY. After one LZSS, number of times COPY is next ... 867,693 After one LZSS, number of times LZSS is next ... 1,290,909 I also decided to take a look at how often each COPY length crops up. COPY 1 occurs 291,422 times. COPY 2 occurs 152,959 times. COPY 3 occurs 101,741 times. COPY 4 occurs 70,629 times. COPY 5 occurs 48,244 times. COPY 6 occurs 37,597 times. COPY 7 occurs 27,232 times. COPY 8 occurs 23,353 times. COPY 9 occurs 16,682 times. COPY 10 occurs 16,159 times. COPY 11 occurs 12,960 times. COPY 12 occurs 11,062 times. COPY 13 occurs 8,580 times. COPY 14 occurs 8,237 times. COPY 15 occurs 5,806 times. COPY >15 occurs 40,045 times. Number Of Copies 872,708, Total Bytes Copied 3,882,406. So, over half of the copies were for 3 bytes-or-less. I think that you'll find that the reason that LZSS isn't used anymore on modern computers is more to do with its traditional 4096-byte window and 4-bit length count not matching modern data well, especially with things like 32-bit RGBA pixels and CPU instructions. LZ4 definitely gets better compression than LZSS ... but then virtually everything else beats LZ4, and sometimes with only a little extra cost in cycles. LZ4 also has better worst-case compression behavior than LZSS, but that is usually more of a practical concern in the real-time compression situations that LZ4 was designed for, than things like games where you have other ways to deal with uncompressable data. I still stand by what I said before ... but since it's just my opinion, you're welcome to feel differently. IMHO, for speed and simplicity, it's hard to beat LZSS on an 8-bit CPU like the 6502. If you really want good compression, then there are better options than LZ4, and you probably won't notice the cost in CPU cycles. Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream. Even if I agreed with you, which I don't, why-on-earth would a few extra bytes in the decompression code really make any difference? You're still talking about a routine that will quite-happily fit into either zero-page, or unused stack-space, if you're trying to decompress a whole program on load. ... And if you're doing that on an Atari, then using a better compression scheme (such as aPlib) would give you less data to load through the SIO, and save more loading time than the little extra CPU cost in the decompression code would take. BTW ... apart from our difference of opinion, which is mostly of a theoretical nature, rather than something that should have any effect on rensoup's decision over what to use, I've got to say that your LZ4 decompressor is *very* nicely written. Your "docopy" inner-loop is a beautiful example of using the X register for holding both the low-byte of the copy-count and the offset at the same time! :-) 1 Quote Link to comment Share on other sites More sharing options...
ivop Posted May 17, 2019 Share Posted May 17, 2019 Another idea: split SAP-R to 9 streams, delta them (good especially for AUDCn) and huffman, reverse when playing. Will result in small symbol tables and quite large decompression routine :] My idea has always been split into 9 streams, delta and RLE. Never got to implementing it yet. I think RLE will be enough to shrink this kind of data enormously. Quote Link to comment Share on other sites More sharing options...
flashjazzcat Posted May 17, 2019 Share Posted May 17, 2019 Not to pretend I have any experience with LZ compression, since I don't, but I achieved good results with RLE compression on data with a reasonable amount of repetition by compressing/decompressing on a row/column basis rather than a continuous linear stream. Obviously this lends itself well to data which has conspicuous repeated runs in columns, such as a bit-mapped logo or title. But it lends itself to less obvious situations as well, such as 8x8 font data. Compressing a typical 1KB font as an 8x128 array yields enormously long runs of $00 which typically shaves a couple of hundred bytes of the length of the compressed data. Anyway: this is something I arrived at while messing around, but it may already be taken into account here. The decompressor remains wonderfully lightweight. 3 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 17, 2019 Share Posted May 17, 2019 I hope this won't bore those of you reading this, but this thread seems like a good place to give a real-world example of the compression results when using different LZ-based compressors, so that folks can get a better idea of how small the improvement is that we're talking about for the more-advanced compressors. BTW, SWD is the compressor that I wrote to try to see how good a job I could do. The only difference between the SWD4 and SWD5 compressors, is the switch from using the traditional 1-bit LZSS-style flag for literals, to using the LZ4-style count value for literals. It's a good indication of whether using the LZ4-style count is really much of an improvement. These results are from testing one game's code/data files on the PC Engine again. The results should be pretty-similar with Atari data. Total uncompressed data size : 23,278,006 Total YsIV compression : 10,347,187 (LZSS with 4-bit count and 8-bit offset) Total Anearth compression (no preload) : 9,706,570 (LZSS with 4-bit count and 12-bit offset) Total Legend of Xanadu 1 compression : 8,937,897 (roughly equivalent to LZ4) Total Legend of Xanadu 2 compression : 8,324,785 Total Elmer's SWD4 compression : 8,114,319 (with traditional 1-bit for literal/match) Total Elmer's SWD5 compression : 8,035,016 (with LZ4-style count of literals) Total Jørgen Ibsen's aPLib compression : 7,782,457 Total Pasi Ojala's puCrunch compression : 7,780,261 Total Magnus Lind's Exomizer : 7,377,469 Here's a direct comparison between the LZ4, SWD4&5 and Legend of Xanadu compressors, when testing the graphics files from this site ... http://www.brutaldeluxe.fr/products/crossdevtools/lz4/ LoX1 6,490 LoX2 5,844 Swd4 5,546 Swd5 5,556 LZ4 6,508 ANGELFISH.BIN LoX1 23,379 LoX2 20,771 Swd4 20,617 Swd5 20,665 LZ4 23,520 ASTRONUT.BIN LoX1 15,087 LoX2 14,027 Swd4 13,844 Swd5 13,863 LZ4 14,802 BEHEMOTH.BIN LoX1 3,406 LoX2 2,519 Swd4 2,446 Swd5 2,449 LZ4 2,803 BIG.BIN LoX1 9,329 LoX2 8,003 Swd4 7,933 Swd5 7,962 LZ4 8,865 BUTTERFLY.BIN LoX1 7,039 LoX2 6,206 Swd4 6,102 Swd5 6,113 LZ4 6,654 CD.BIN LoX1 18,368 LoX2 16,874 Swd4 16,335 Swd5 16,369 LZ4 18,876 CLOWN.BIN LoX1 7,982 LoX2 7,215 Swd4 7,235 Swd5 7,260 LZ4 7,662 COGITO.BIN LoX1 14,871 LoX2 12,748 Swd4 12,560 Swd5 12,581 LZ4 15,300 COTTAGE.BIN LoX1 13,389 LoX2 11,873 Swd4 11,711 Swd5 11,733 LZ4 13,102 FIGHTERS.BIN LoX1 13,431 LoX2 12,260 Swd4 11,936 Swd5 11,954 LZ4 13,220 FLOWER.BIN LoX1 10,198 LoX2 9,029 Swd4 8,804 Swd5 8,813 LZ4 9,972 JAZZ.BIN LoX1 14,834 LoX2 13,403 Swd4 13,182 Swd5 13,206 LZ4 14,810 KNIFE.BIN LoX1 19,485 LoX2 17,932 Swd4 17,727 Swd5 17,730 LZ4 20,261 LORI.BIN LoX1 8,724 LoX2 8,143 Swd4 7,893 Swd5 7,905 LZ4 8,643 MAX.BIN LoX1 17,921 LoX2 15,114 Swd4 14,798 Swd5 14,803 LZ4 18,474 OWL.BIN LoX1 20,625 LoX2 18,564 Swd4 18,285 Swd5 18,307 LZ4 20,595 RED.DRAGON.BIN LoX1 15,496 LoX2 13,993 Swd4 13,483 Swd5 13,500 LZ4 16,306 TAJ.BIN LoX1 12,907 LoX2 11,205 Swd4 11,128 Swd5 11,164 LZ4 12,551 TUT.BIN My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor. Similarly, if you're really looking for compression, then LZ4 just doesn't make much sense against something like Jørgen Ibsen's aPLib, which is basically just yet-another LZ compressor, but with super-optimized encoding of the copy lengths and offsets. 1 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 17, 2019 Share Posted May 17, 2019 (edited) Also, in 6502 assembly, LZ4 decompression code is just shorter, because you reuse the code for copy-from-past to do the copy-from-stream. Here you go, 118-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes. <EDIT> Modified from original to include dmsc's code for reading the nibbles, and also to include an alternate HuC6280 version. <EDIT2> Improved the nibble extraction, cutting 14-cycles-per-byte off the cost with no more bytes used. This is DONE! lz_decompress: ldx #0 ; Initialize empty nibble. stx <lz_nibble ldy #0 ; Initialize source index. sec .load_command: lda [lz_srcptr],y ; Reload an empty bit-buffer. iny bne .skip1 inc <lz_srcptr + 1 .skip1: ror a sta <lz_bitbuf bne .got_command ; Always true. .next_command: lsr <lz_bitbuf ; Get next command bit. beq .load_command ; Is the bit-buffer empty? .got_command: lda [lz_srcptr],y ; Get literal/offset byte. iny beq .next_page bcc .match ; CS=literal, CC=match. .literal: dex ; X == 0 from last copy loop. inc <lz_dstptr + 0 ; Write the byte directly to beq .write_byte ; the output. dec <lz_dstptr + 1 bcs .write_byte ; Always true. .next_page: inc <lz_srcptr + 1 bcs .literal ; CS=literal, CC=match. .match: adc <lz_dstptr + 0 ; Window range $FFFF..$FF00. sta <lz_winptr + 0 lda <lz_dstptr + 1 adc #$FF sta <lz_winptr + 1 ; Window never in ZP, so NZ. lda <lz_nibble ; Is there a nibble waiting? bne .got_nibble ; X == 0 from last copy loop. lda [lz_srcptr],y beq .finished ; Value = 0 == Finished. iny bne .skip2 inc <lz_srcptr + 1 .skip2: tax lsr a lsr a lsr a lsr a .got_nibble: stx <lz_nibble ; Next nibble. and #$0F ; Current nibble. tax sec ; Value 1..15 = Count 2..16. adc <lz_winptr + 0 sta <lz_winptr + 0 bcs .skip3 dec <lz_winptr + 1 .skip3: txa sec ; Value 1..15 = Count 2..16. adc <lz_dstptr + 0 sta <lz_dstptr + 0 bcs .skip4 dec <lz_dstptr + 1 .skip4: txa ; Value 1..15 = Offset $FE..$F0. eor #$FF tax lz_winptr = * + 1 .copy_loop: lda $FFFF,x lz_dstptr = * + 1 .write_byte: sta $FFFF,x inx bne .copy_loop inc <lz_dstptr + 1 jmp .next_command .finished: rts lz_srcptr dw 0 lz_bitbuf db 0 lz_nibble db 0 Edited May 20, 2019 by elmer 2 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 17, 2019 Share Posted May 17, 2019 Anyway: this is something I arrived at while messing around, but it may already be taken into account here. The decompressor remains wonderfully lightweight. This is excellent advice ... rearranging data to improve compression is a time-honored technique! It is basically the same basic idea as dmsc very-sensibly suggested earlier on ... there is absolutely nothing that *requires* the LZ4/LZSS window-onto-past-history to point into the data that you're decompressing right now, it can just-as-easily point into a previous frame's data if you're streaming. That's not how compressors are usually set up, but it's an easy modification to make if you've got the source-code and a willingness to modify and compile your own tools. 1 Quote Link to comment Share on other sites More sharing options...
dmsc Posted May 18, 2019 Share Posted May 18, 2019 Hi! It's great to have this kind of informed discussion here We'll have to agree to disagree on that. That's good! The LZSS bitshift, traditionally of a zero-page location, is much quicker than all of the setup that LZ4 has to do for a copy-literals-from-compressed-data. LZ4 will only start to outpace a well-written 6502 LZSS decompressor *if* the count of literals to copy at one time is large-enough. Yes, but I think that copying more than 3 bytes already is faster in my LZ4 implementation, making most literal copies faster. Here's a post that I made on the subject a few years ago, regarding the game data for a PC Engine translation (which, if you're not familiar with the console, runs a customized 7.16MHz 65C02 variant with a bunch of extra, and highly useful, new instructions) ... I think that you'll find that the reason that LZSS isn't used anymore on modern computers is more to do with its traditional 4096-byte window and 4-bit length count not matching modern data well, especially with things like 32-bit RGBA pixels and CPU instructions. LZ4 definitely gets better compression than LZSS ... but then virtually everything else beats LZ4, and sometimes with only a little extra cost in cycles. LZ4 also has better worst-case compression behavior than LZSS, but that is usually more of a practical concern in the real-time compression situations that LZ4 was designed for, than things like games where you have other ways to deal with uncompressable data. I still stand by what I said before ... but since it's just my opinion, you're welcome to feel differently. I think that a common ground is: - If your window size is small, LZSS will provide better compression, as you can use less bits for the encoding of small matches. - If the window size is big and the data is not too repetitive, LZ4 will provide better compression for a comparable speed. IMHO, for speed and simplicity, it's hard to beat LZSS on an 8-bit CPU like the 6502. I think that simplicity it about the same for LZ4 and LZSS. Also, you could include other LZ77 formats like LZO (with an encoding a little more complicated that LZ4). If you really want good compression, then there are better options than LZ4, and you probably won't notice the cost in CPU cycles. I agree, but for a streaming application, each cycle counts. Also, I think that for any increase in compression ratio you must use Huffman or other variable length coding, and this kills your ram usage and speed in the 6502. Even if I agreed with you, which I don't, why-on-earth would a few extra bytes in the decompression code really make any difference? You're still talking about a routine that will quite-happily fit into either zero-page, or unused stack-space, if you're trying to decompress a whole program on load. ... And if you're doing that on an Atari, then using a better compression scheme (such as aPlib) would give you less data to load through the SIO, and save more loading time than the little extra CPU cost in the decompression code would take. Because the *total* size of the data is what matters, so you need to compare the size of the compressed data plus the decompressor to get the "real" compression ratio. BTW ... apart from our difference of opinion, which is mostly of a theoretical nature, rather than something that should have any effect on rensoup's decision over what to use, I've got to say that your LZ4 decompressor is *very* nicely written. Your "docopy" inner-loop is a beautiful example of using the X register for holding both the low-byte of the copy-count and the offset at the same time! :-) Thanks! If you look at my FastBasic sources you can see that I use the same technique to implement the MOVE statement: https://github.com/dmsc/fastbasic/blob/master/src/interp/move.asm Much more difficult is the complement "backward move", as you can only copy 255 bytes in the inner loop: https://github.com/dmsc/fastbasic/blob/master/src/interp/nmove.asm I hope this won't bore those of you reading this, but this thread seems like a good place to give a real-world example of the compression results when using different LZ-based compressors, so that folks can get a better idea of how small the improvement is that we're talking about for the more-advanced compressors. BTW, SWD is the compressor that I wrote to try to see how good a job I could do. The only difference between the SWD4 and SWD5 compressors, is the switch from using the traditional 1-bit LZSS-style flag for literals, to using the LZ4-style count value for literals. It's a good indication of whether using the LZ4-style count is really much of an improvement. Did you implement optimal parsing? (or at least, lazy parsing with at least checking 4 bytes in advance) Most papers say that you can get a lot of extra compression ration with that (and it is what LZ4HC compressor does). These results are from testing one game's code/data files on the PC Engine again. The results should be pretty-similar with Atari data. Total uncompressed data size : 23,278,006 Total YsIV compression : 10,347,187 (LZSS with 4-bit count and 8-bit offset) Total Anearth compression (no preload) : 9,706,570 (LZSS with 4-bit count and 12-bit offset) Total Legend of Xanadu 1 compression : 8,937,897 (roughly equivalent to LZ4) Total Legend of Xanadu 2 compression : 8,324,785 Total Elmer's SWD4 compression : 8,114,319 (with traditional 1-bit for literal/match) Total Elmer's SWD5 compression : 8,035,016 (with LZ4-style count of literals) Total Jørgen Ibsen's aPLib compression : 7,782,457 Total Pasi Ojala's puCrunch compression : 7,780,261 Total Magnus Lind's Exomizer : 7,377,469 Here's a direct comparison between the LZ4, SWD4&5 and Legend of Xanadu compressors, when testing the graphics files from this site ... http://www.brutaldeluxe.fr/products/crossdevtools/lz4/ LoX1 6,490 LoX2 5,844 Swd4 5,546 Swd5 5,556 LZ4 6,508 ANGELFISH.BIN LoX1 23,379 LoX2 20,771 Swd4 20,617 Swd5 20,665 LZ4 23,520 ASTRONUT.BIN LoX1 15,087 LoX2 14,027 Swd4 13,844 Swd5 13,863 LZ4 14,802 BEHEMOTH.BIN LoX1 3,406 LoX2 2,519 Swd4 2,446 Swd5 2,449 LZ4 2,803 BIG.BIN LoX1 9,329 LoX2 8,003 Swd4 7,933 Swd5 7,962 LZ4 8,865 BUTTERFLY.BIN LoX1 7,039 LoX2 6,206 Swd4 6,102 Swd5 6,113 LZ4 6,654 CD.BIN LoX1 18,368 LoX2 16,874 Swd4 16,335 Swd5 16,369 LZ4 18,876 CLOWN.BIN LoX1 7,982 LoX2 7,215 Swd4 7,235 Swd5 7,260 LZ4 7,662 COGITO.BIN LoX1 14,871 LoX2 12,748 Swd4 12,560 Swd5 12,581 LZ4 15,300 COTTAGE.BIN LoX1 13,389 LoX2 11,873 Swd4 11,711 Swd5 11,733 LZ4 13,102 FIGHTERS.BIN LoX1 13,431 LoX2 12,260 Swd4 11,936 Swd5 11,954 LZ4 13,220 FLOWER.BIN LoX1 10,198 LoX2 9,029 Swd4 8,804 Swd5 8,813 LZ4 9,972 JAZZ.BIN LoX1 14,834 LoX2 13,403 Swd4 13,182 Swd5 13,206 LZ4 14,810 KNIFE.BIN LoX1 19,485 LoX2 17,932 Swd4 17,727 Swd5 17,730 LZ4 20,261 LORI.BIN LoX1 8,724 LoX2 8,143 Swd4 7,893 Swd5 7,905 LZ4 8,643 MAX.BIN LoX1 17,921 LoX2 15,114 Swd4 14,798 Swd5 14,803 LZ4 18,474 OWL.BIN LoX1 20,625 LoX2 18,564 Swd4 18,285 Swd5 18,307 LZ4 20,595 RED.DRAGON.BIN LoX1 15,496 LoX2 13,993 Swd4 13,483 Swd5 13,500 LZ4 16,306 TAJ.BIN LoX1 12,907 LoX2 11,205 Swd4 11,128 Swd5 11,164 LZ4 12,551 TUT.BIN My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor. Similarly, if you're really looking for compression, then LZ4 just doesn't make much sense against something like Jørgen Ibsen's aPLib, which is basically just yet-another LZ compressor, but with super-optimized encoding of the copy lengths and offsets. Well, I agree, but I *do* think that LZ4 can be faster, specially for streaming images like my example. This is because you can copy large parts of one frame to the next. Here you go, 124-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes. LZ_GET_SRC is a macro which could be changed to a JSR if someone *really* wanted to save a few more bytes. Very good code! I like the way you extract 4 bits for each match from the length byte, but I think this could work better, with the only restriction of adding one zero byte to signal end of stream if not already 0: lda <lz_nibble bne .ok LZ_GET_SRC beq .finish ok: tax lsr a lsr a lsr a lsr a sta <lz_nibble txa and #$0F This should be 9 bytes less (if I counted right). I think that a "flexible" LZ77 compression library, supporting different coding schemes and multiple dictionary support (for implementing last-frame and current-frame matching) could be a great project. Then, you could experiment with increasing offset lengths, implementing "address instead of offsets" (like my modified LZ4) and many other variations to search the best compression/speed for a given application. Have fun! 1 Quote Link to comment Share on other sites More sharing options...
Heaven/TQA Posted May 18, 2019 Share Posted May 18, 2019 Ok. Noob question. For the 65816 SNES rom depack what would you recommend? 2 Quote Link to comment Share on other sites More sharing options...
popmilo Posted May 18, 2019 Share Posted May 18, 2019 Guys you are doing a great job on this topic ! Thanks for all this info ! Noob question 2 - what would you recommend for compression of sound samples ? 2 Quote Link to comment Share on other sites More sharing options...
Irgendwer Posted May 18, 2019 Share Posted May 18, 2019 (edited) - what would you recommend for compression of sound samples ? For the typical low-fi samples on "our" machines: Quantized delta encoding. Halving exactly the data size. Edit: That said, you have to go for for at least 15kHz replay to cope with the aliasing effects. If the replay frequency is lower, going for 2 bit raw data isn't that bad: Edit 2: Of course it depends if you like to depack the data on-the-fly while replaying or are able to inflate the data before playing. Both principles above are for real-time preparation of the data, saving RAM and not only floppy space. Edited May 18, 2019 by Irgendwer 4 Quote Link to comment Share on other sites More sharing options...
rensoup Posted May 18, 2019 Author Share Posted May 18, 2019 (edited) My conclusion from these tests is that the improvement between LZ4 and LZSS (even when using a small 8-bit window offset) just isn't large enough to make LZ4 compelling *if* you can write an LZSS decompressor that is faster than your LZ4 decompressor. Well I got curious, so I took my 9 deinterleaved raw SAPR channels and compressed them with LZSS v0.01. and the total was about 4KB vs 2KB for LZ4 best setting. That's just one very tiny set of data and perhaps that wasn't the best LZSS compressor... Here you go, 124-bytes for an LZSS decompressor (4-bit count & 8-bit window), with full inlining for speed (i.e. no jsr/rts overhead) ... and the copy-from-past code *is* reused to save a few bytes. LZ_GET_SRC is a macro which could be changed to a JSR if someone *really* wanted to save a few more bytes. Well thanks for that! That's not how compressors are usually set up, but it's an easy modification to make if you've got the source-code and a willingness to modify and compile your own tools. The obvious reason for that would be that the compression ratio may be somewhat worse, I guess ? Well, call me lazy but I've not used C in years (I'm all C# these days) and I don't even think I installed it in my VS. Edited May 18, 2019 by rensoup Quote Link to comment Share on other sites More sharing options...
dmsc Posted May 18, 2019 Share Posted May 18, 2019 Hi! Replying to a private message from @elmer, with his permission, and I simplified the code a little from the response I gave to his PM. I've got a 35KB SAP-R, it's one of those fancy RMTs that use 40% of the frame for no good reason, seems that particular song doesn't use all the effects though. RMT size is 3.5KB Btw, I know pretty much nothing about RMT, I just used the standard replay routine on that tune and SAP-Rd it. compressed with LZ4 best setting, it takes 9.5KB deinterleaved into 9 files and LZ4 on each, total about 2KB (Less than the original!). Obviously this is the way to go. ..... Instead of running the decoder 9 times per frame to output 1 byte for each channel, I just output 9 bytes of 1 channel then next frame 9 bytes from the next channel so it's less costly to setup the decoder. I use a double 9x9 buffer so it plays from one 9x9 buffer while I fill up the other one over 9 frames. (At startup I just output all 9 bytes from all 9 buffers of course) ..... For your method, you suggested that I compress at least 72 bytes chunks but is it relevant in this case ? Couldn't I compress bigger chunks like 512 bytes, as long as I modify the decoder to only output 9 bytes at a time and use that double buffer scheme ( it would hurt speed a little of course)... First, replying to your last question, it is better do decompress a whole block (for example, 256 bytes) in one run, and then use a few bytes on each frame. This will make the code simpler and faster. But...! What you really want to do is implementing a SAP player for your RTM tune, so perhaps you could concentrate on that problem first. You already discovered that each de-interleaved data compresses a lot, this means that each music channel is correlated with itself but independent on the others. This gives you a strong clue that this should be the method to use. As @elmer said, you can use LZSS for its simplicity. For example, you can start by simply coding using RLE (like you proposed before) each channel, using code like this: ; Decode one "frame" of data (one byte for each POKEY register) sap_loop: ldx #8 ; Loop through all "streams", one for each POKEY register chn_loop: lsr bit_data ; Get next bit bne got_bit jsr get_byte ; Not enough bits, refill! ror a ; Extract a new bit and add a 1 at the high bit (from C set above) sta bit_data ; got_bit: bcc skip ; Bit = 0 (CC) is "repeat last byte again", bit = 1 (CS) is "get a new byte" jsr get_byte sta POKEY, x skip: dex bpl chn_loop ; continue..... The coding uses one byte (bit_data) to hold bits that are 1 if the POKEY register changes or 0 if it keeps it value, and simply reads bytes from the stream when new bits or bytes are needed. Note that the *coded* bytes are interleaved, this makes the routine much simpler as you read bytes using only one pointer. So a stream (in hex) like: $FF XX YY ZZ MM NN QQ RR SS $80 TT $00 $01 UU $00 .... This reads as: 8 bits 1, so read 8 vales to pokey registers 0 to 7, then one bit 1 and 7 bits 0, read 1 value to pokey register 8, skip registers 0 to 6, then 8 bits 0, skip registers 7,8 and 0-5, then 7 bits 0 and one bit 1, skips registers 6,7,8,0,1,2 and 3, read one byte UU to register 4, and so on. So, this is equivalent to the data: XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM NN QQ RR SS TT XX YY ZZ MM UU QQ RR SS TT XX .... Hope above example makes it clearer. From that routine, you can add simple LZ77 matching, for example with distances from 0 to 15 and lengths from 2 to 17 (4 higher bits are length, lower 4 are position). Buffer is 9 * 16 = 144 bytes, as there are a 16 byte buffer for each "channel" (pokey register). Code below completely untested, I think that @elmer could verify and perhaps optimize it a little: sap_loop: ldx #8 ; Loop through all "channels", one for each POKEY register chn_loop: txa ; Get channel buffer address, used to store old channel data asl asl asl asl sta cur_chan lda chn_copy, x ; Get status of this stream, if > 0 we are copying bytes beq next_bit ; No, we can get a new frame copy_byte: dec chn_copy, x ; Decrease match length, increase match position lda chn_pos, x and #$0F sec adc cur_chan sta chn_pos, x tay ; Now, read old data, jump to data store lda buffer, y jmp store next_bit: ; We are decoding a new match/literal lsr bit_data ; Get next bit bne got_bit jsr get_byte ; Not enough bits, refill! ror a ; Extract a new bit and add a 1 at the high bit (from C set above) sta bit_data ; got_bit: jst get_byte ; Always read a byte, it could mean "match size/offset" or "literal byte" bcc store ; Bit = 0 is "literal", bit = 1 is "match" sta chn_pos, x ; Store "position" (no need to AND #$0F, it will be done above) adc #$1F ; Adds 2 to match length (C = 1 from above) lsr lsr lsr lsr sta chn_copy, x ; Store in "copy length" bne do_copy_byte ; go to copy first byte store: sta POKEY, x lda cur_pos ; Store into bufer, get current position + channel into Y adc cur_chan tay sta buffer, y dex bpl chn_loop ; Next channel ; continue..... 1 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 18, 2019 Share Posted May 18, 2019 It's great to have this kind of informed discussion here Yep, it's fun to find an opportunity to talk about compression techniques, even if we draw different conclusions from our experiences. Yes, but I think that copying more than 3 bytes already is faster in my LZ4 implementation, making most literal copies faster. Absolutely! The question then becomes ... do you get enough copies of more-than-3-bytes to make up for loss in speed on 3-or-less? This question becomes even more interesting when you're programming for other 6502-variants, and have some extra CPU instructions to play with. Anyway ... my personal testing showed that copies of 3-or-less were more common than copies of more-than-3. It all depends upon the actual data that we're compressing. I think that a common ground is: - If your window size is small, LZSS will provide better compression, as you can use less bits for the encoding of small matches. - If the window size is big and the data is not too repetitive, LZ4 will provide better compression for a comparable speed. Yep, that's about right. In my experience on 8-bit and 16-bit platforms, from the POV of compression-over-speed, you really get a benefit from finding an algorithm that can encode 2-byte matches efficiently. Whether you then use the traditional 1-bit-for-copy-or-match vs the count-of-bytes-to-copy doesn't seem to matter much to the compression ratio in most data. I agree, but for a streaming application, each cycle counts. Which is where we get back to our slightly different opinions on which algorithm is faster in *typical* real-world cases (on 8-bit and 16-bit machines)! Really ... the only answer is for any potential user to try both and see what the difference is with *their* data. Also, I think that for any increase in compression ratio you must use Huffman or other variable length coding, and this kills your ram usage and speed in the 6502. Going with Huffman, as in gzip, will certainly get you better compression ... but the CPU and space cost definitely doesn't work on a 6502. Other variable-length encoding schemes (which LZ4 actually is with its variable-length-count) can definitely provide good results. Just look at where puCrunch, aPlib, Exomizer, and my own SWD compressors fall in comparison to LZ4. All of those compressors were written for use on 8-bit CPUs such as the 6502 ... just not for the relatively-rare circumstances of streaming applications. Because the *total* size of the data is what matters, so you need to compare the size of the compressed data plus the decompressor to get the "real" compression ratio. Isn't that an argument for using puCrunch, aPlib & Exomizer? Did you implement optimal parsing? (or at least, lazy parsing with at least checking 4 bytes in advance) Most papers say that you can get a lot of extra compression ration with that (and it is what LZ4HC compressor does). Nope, not in my SWD encoding. I used the official LZ4 compressor for testing ... I have no idea if that has lazy encoding. Lazy encoding does sound like a good idea for *any* of the LZ-style compressors, and I've been meaning to try it out. I like the way you extract 4 bits for each match from the length byte, but I think this could work better, with the only restriction of adding one zero byte to signal end of stream if not already 0: lda <lz_nibble bne .ok LZ_GET_SRC beq .finish ok: tax lsr a lsr a lsr a lsr a sta <lz_nibble txa and #$0F OK, I've changed my LZSS source above to use that code, thanks! I knew that I was missing something when I wrote the nibble-handling. It looks like it saves 8-bytes in the code ... but it costs an extra 7-cycles-per-byte with the bit-shifting, so it's not free. I've also rearranged the LZSS code to show how to take advantage of branch-likelihood to shave more cycles off of the decompression, and that cost another 2-bytes, so the current code is actually down to 118 bytes (on the Atari). I think that a "flexible" LZ77 compression library, supporting different coding schemes and multiple dictionary support (for implementing last-frame and current-frame matching) could be a great project. Then, you could experiment with increasing offset lengths, implementing "address instead of offsets" (like my modified LZ4) and many other variations to search the best compression/speed for a given application. Yep, that's what I'd call "an exercise for the reader". I did that myself many years ago, that's where my SWD variants came from ... only to find out that Jørgen Ibsen had already come up with a better scheme! 1 Quote Link to comment Share on other sites More sharing options...
elmer Posted May 18, 2019 Share Posted May 18, 2019 What you really want to do is implementing a SAP player for your RTM tune, so perhaps you could concentrate on that problem first. You already discovered that each de-interleaved data compresses a lot, this means that each music channel is correlated with itself but independent on the others. This gives you a strong clue that this should be the method to use. Doh! ... I've only-just realized that rensoup is actually talking about *music* compression, silly me! Yep, that's an absolute swine to compress if you've only got the data in a register-contents-per-frame format like SAP. As dmsc just said ... the best thing for a format like that is to separate the channels and registers and then do whatever you can to find some repetition to remove. Having said which ... if you only need to decompress 9-bytes of data per frame to feed the POKEY registers, then decompression speed really isn't your problem IMHO. You're going to be more concerned with managing ring-buffers. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.