cc65 users: printtokatari.c

Harry Potter · October 27, 2022

Hi! I'm sorry for SPAMming, but I really want to share one of my files. I have a program for cc65 that prints strings with embedded tokens and expands the tokens to full strings. The program is very small, so the tokenization is likely to help. It decreased the size of one of my text adventures' string data by about 13%. It also uses my AtaSimpleIO library to minimize overhead. It is located at c65 additions - Manage /ui at SourceForge.net and called printtok_atari.c. Try it out!

Wrathchild · October 27, 2022

Take a look at: https://xania.org/201406/elites-crazy-string-format

Harry Potter · October 27, 2022

I just looked at your post, and it seems that my code is more efficient than his but doesn't include BPE. Also, I only have room for 32 tokens. On an Atari, I could use 0xA0 to 0xFF for the tokens, but I'm also coding for the CBM computers, on which cc65 seems to convert literals from ASCII instead of leaving it in PETSCII. If I don't need upper-case, I can use A-Z as tokens or BPE codes, but a text adventure I'm coding uses upper-case. I can add BPE and extend the codes as said on an Atari. On a CBM, I can add 26 tokens to the mix if upper-case is not used. Someone on the Denial Vic20 forum gave me the idea to use Static Huffman codes to shorten literals. What else can I do to improve printtok()?

Wrathchild · October 27, 2022

Your approach seems to imply that you intend to hand code the strings which would seem over complicated.

You should really have a separate application that can input a file of text, e.g. per line, and encode them for you.

For speed in decoding this can prepend an offset and/or length table to access a string quickly.

If you think back to other compression techniques, you can move away from single tokens and use byte pairs, i.e. a token and an argument.

This was you can have encoding that searches the existing string table for already encoded text and employ a token that takes the string number as its argument.

Harry Potter · October 28, 2022

I thought about a byte pair to indicate a token, but I think my method is more efficient. I was thinking about creating code to compress the text automatically. If I can do that, I can add Huffman codes and Placement Offset to the technique. I'm not prepared to do this just yet. Is there anything else I can do to improve this function?

BTW, if you want to know what Placement Offset is, just ask.

Wrathchild · October 28, 2022

I don't see it as really practical as it stands. Approaches like the mentioned Elite, or even that employed by the Infocom adventures are better choices IMO, even using any existing packer tool with a common key would suffice.

In terms of the file itself, the user would have to edit the 'tokens' array their own purpose. In a 'module', the user would be better off passing their own array during initialisation or it could instead be declared as extern.

The comments reference the range 0x80 to 0x9F, whereas the code uses 0xA0 to 0xBF, so that needs to be corrected.

You are exploiting the fact that CC65 sets char to unsigned char by default (can be reversed via a command option). So having 'char i' and 'i>=0xA0' is non portable. Maybe adopt stdint in your work, e.g. uint8_t

Remove redundant code.

Harry Potter · October 28, 2022

Wrathchild, I thank you for your input. Where can I find the software to do the encoding for cc65? I want to encode for multiple systems. How does Infocom do the compression? I have to look at the Elite method again.

Wrathchild · October 28, 2022

Google is your friend

Harry Potter · October 28, 2022

Found it! It uses 5 bits to encode each character. I can't use the technique as stated, as my text adventure requires both upper-case and lower-case. I could encode in 7 bits per character, though. Thank you.

Wrathchild · October 28, 2022

Wishbringer Atari 8-bit Starting location

What makes you think Infocom games don't have upper and lowercase text?!

Harry Potter · October 28, 2022

I just looked back at the docs. You are right. But I still need space for the tokens. I can use a special character to indicate a token and specify that the next character is a token. Again, thank you. In the mean time, is there a way to improve printtok() as it stands?

Harry Potter · October 28, 2022

On some codes, I could add an extra bit to determine one of two characters. Less used characters could be encoded with 6 bits, thereby increasing the number of characters that could be supported. What do you think?

Wrathchild · October 28, 2022

I'd recommend trying to write your own encoder first.

Harry Potter · November 5, 2022

I am currently working on a better version of the function. Right now, I have the lits compressor and the output writer started. I plan to use a modification of Infocom's technique and tokenization. There's no room for BPE; LZ77 can't be used, as it would need to reference previous strings, and they are compressed; my Placement Offset Basic technique would cost more than help, and Huffman would require extra complexity. Does anybody here have other ideas on how I can compress text?

cc65 users: printtokatari.c

Recommended Posts

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members