Harry Potter Posted October 27, 2022 Share Posted October 27, 2022 Hi! I'm sorry for SPAMming, but I really want to share one of my files. I have a program for cc65 that prints strings with embedded tokens and expands the tokens to full strings. The program is very small, so the tokenization is likely to help. It decreased the size of one of my text adventures' string data by about 13%. It also uses my AtaSimpleIO library to minimize overhead. It is located at c65 additions - Manage /ui at SourceForge.net and called printtok_atari.c. Try it out! Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 27, 2022 Share Posted October 27, 2022 Take a look at: https://xania.org/201406/elites-crazy-string-format Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 27, 2022 Author Share Posted October 27, 2022 I just looked at your post, and it seems that my code is more efficient than his but doesn't include BPE. Also, I only have room for 32 tokens. On an Atari, I could use 0xA0 to 0xFF for the tokens, but I'm also coding for the CBM computers, on which cc65 seems to convert literals from ASCII instead of leaving it in PETSCII. If I don't need upper-case, I can use A-Z as tokens or BPE codes, but a text adventure I'm coding uses upper-case. I can add BPE and extend the codes as said on an Atari. On a CBM, I can add 26 tokens to the mix if upper-case is not used. Someone on the Denial Vic20 forum gave me the idea to use Static Huffman codes to shorten literals. What else can I do to improve printtok()? Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 27, 2022 Share Posted October 27, 2022 Your approach seems to imply that you intend to hand code the strings which would seem over complicated. You should really have a separate application that can input a file of text, e.g. per line, and encode them for you. For speed in decoding this can prepend an offset and/or length table to access a string quickly. If you think back to other compression techniques, you can move away from single tokens and use byte pairs, i.e. a token and an argument. This was you can have encoding that searches the existing string table for already encoded text and employ a token that takes the string number as its argument. Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 28, 2022 Author Share Posted October 28, 2022 I thought about a byte pair to indicate a token, but I think my method is more efficient. I was thinking about creating code to compress the text automatically. If I can do that, I can add Huffman codes and Placement Offset to the technique. I'm not prepared to do this just yet. Is there anything else I can do to improve this function? BTW, if you want to know what Placement Offset is, just ask. Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 28, 2022 Share Posted October 28, 2022 I don't see it as really practical as it stands. Approaches like the mentioned Elite, or even that employed by the Infocom adventures are better choices IMO, even using any existing packer tool with a common key would suffice. In terms of the file itself, the user would have to edit the 'tokens' array their own purpose. In a 'module', the user would be better off passing their own array during initialisation or it could instead be declared as extern. The comments reference the range 0x80 to 0x9F, whereas the code uses 0xA0 to 0xBF, so that needs to be corrected. You are exploiting the fact that CC65 sets char to unsigned char by default (can be reversed via a command option). So having 'char i' and 'i>=0xA0' is non portable. Maybe adopt stdint in your work, e.g. uint8_t Remove redundant code. Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 28, 2022 Author Share Posted October 28, 2022 Wrathchild, I thank you for your input. Where can I find the software to do the encoding for cc65? I want to encode for multiple systems. How does Infocom do the compression? I have to look at the Elite method again. Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 28, 2022 Share Posted October 28, 2022 Google is your friend Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 28, 2022 Author Share Posted October 28, 2022 Found it! It uses 5 bits to encode each character. I can't use the technique as stated, as my text adventure requires both upper-case and lower-case. I could encode in 7 bits per character, though. Thank you. Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 28, 2022 Share Posted October 28, 2022 What makes you think Infocom games don't have upper and lowercase text?! Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 28, 2022 Author Share Posted October 28, 2022 I just looked back at the docs. You are right. But I still need space for the tokens. I can use a special character to indicate a token and specify that the next character is a token. Again, thank you. In the mean time, is there a way to improve printtok() as it stands? Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted October 28, 2022 Author Share Posted October 28, 2022 On some codes, I could add an extra bit to determine one of two characters. Less used characters could be encoded with 6 bits, thereby increasing the number of characters that could be supported. What do you think? Quote Link to comment Share on other sites More sharing options...
Wrathchild Posted October 28, 2022 Share Posted October 28, 2022 I'd recommend trying to write your own encoder first. 1 Quote Link to comment Share on other sites More sharing options...
Harry Potter Posted November 5, 2022 Author Share Posted November 5, 2022 I am currently working on a better version of the function. Right now, I have the lits compressor and the output writer started. I plan to use a modification of Infocom's technique and tokenization. There's no room for BPE; LZ77 can't be used, as it would need to reference previous strings, and they are compressed; my Placement Offset Basic technique would cost more than help, and Huffman would require extra complexity. Does anybody here have other ideas on how I can compress text? Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.