cc65 header for international characters.. is this useful? Feedback welcome

scottinNH · October 10, 2022

This helps me use international characters (I don't have to lookup hex anymore).

What would you change

anyways, what I got:

/* Atari 8-bit International characters
Documents the International Characters found in the standard US-Western Europe XL/XE ROM.
Reference: https://en.wikipedia.org/w/index.php?title=ATASCII&oldid=1035357981#International_Character_Set

Further international support exists in other Atari Roms (for example: Arab) 
Reference: https://web.archive.org/web/20161025170529/http://joyfulcoder.net/atari/atascii/

"Hello World" language source: https://codegolf.stackexchange.com/questions/146544/hello-world-in-multiple-languages

TRY:
French:       Bonjour monde!
German:       Hallo Welt!
Portuguese:   Olá Mundo!
Spanish:      ¡Hola Mundo!
*/

/* NOTE: UMLAUT and DIAERESIS have same appearance. This is not Unicode so exact name does not matter.
   Which name chosen for the #defines is intentionally aligned with Wikipedia.
   Note you can directly view those pages by pasting "literal character" to end of base URL: 
   https://en.wikipedia.org/wiki/  Example: https://en.wikipedia.org/wiki/ò  */

// define (NAME) (decimal value)  // (hex value) (literal character) (keystroke)
#define LOWER_A_ACUTE           00      //  x00     á   CTRL+,
#define LOWER_U_GRAVE           01      //  x01     ù   CTRL+A
#define UPPER_N_TILDE           02      //  x02     Ñ   CTRL+B
#define UPPER_E_ACUTE           03      //  x03     É	CTRL+C
#define LOWER_C_CEDILLA         04      //  x04     ç   CTRL+D
#define LOWER_O_CIRCUMFLEX      05      //  x05     ô   CTRL+E
#define LOWER_O_GRAVE           06      //  x06     ò	CTRL+F
#define LOWER_I_GRAVE           07      //  x07 	ì	CTRL+G
#define POUND_SIGN              08	    //  x08 	£	CTRL+H
#define LOWER_I_DIAERESIS       09	    //  x09 	ï	CTRL+I
#define LOWER_U_DIAERESIS       10      //  x0A 	ü	CTRL+J
#define LOWER_A_DIAERESIS       11      //  x0B     ä	CTRL+K
#define UPPER_O_DIAERESIS       12      //  x0C     Ö	CTRL+L
#define LOWER_U_ACUTE           13      //  x0D     ú	CTRL+M
#define LOWER_O_ACUTE           14      //  x0E     ó	CTRL+N
#define LOWER_O_DIAERESIS       15      //  x0F     ö	CTRL+O
#define UPPER_U_UMLAUT          16      //  x10 	Ü	CTRL+P
#define LOWER_A_CIRCUMFLEX      17      //  x11 	â	CTRL+Q
#define LOWER_U_CIRCUMFLEX      18      //  x12 	û	CTRL+R
#define LOWER_I_CIRCUMFLEX      19      //  x13 	î	CTRL+S
#define LOWER_E_ACUTE           20      //  x14     é	CTRL+T
#define LOWER_E_GRAVE           21      //  x15     è	CTRL+U
#define LOWER_N_TILDE           22      //  x16     ñ	CTRL+V
#define LOWER_E_CIRCUMFLEX      23	    //  x17 	ê	CTRL+W
#define LOWER_A_OVERRING        24      //  x18 	å	CTRL+X
#define LOWER_A_GRAVE           25 	    //  x19     à	CTRL+Y
#define UPPER_A_OVERRING        26 	    //  x20     Å	CTRL+Z
#define INVERTED_EXCLAMATION    96      //  x60 	¡	CTRL+.
#define UPPER_A_DIAERESIS       123     //  x7b 	Ä	CTRL+:

example caller:

#include <stdio.h>
#include "int_char_set.h"

// TODO: This define logic isn't perfect; it won't falback or default to ny language not "defined" during build"

int main(void)
{
    char pause;

    // POKE 756,204 enables the built-in international character set.
    *(unsigned char*)(756) = 204;

    #ifdef FRENCH 
        printf("Bonjour monde!\n");
    #endif

    #ifdef GERMAN
        printf("Hallo Welt!\n");
    #endif

    #ifdef PORTUGUESE
        printf("Ol%c Mundo!\n", LOWER_A_ACUTE); // Olá Mundo!
    #endif

    #ifdef SPANISH
        printf("%cHola Mundo!\n", INVERTED_EXCLAMATION);
    #endif

    #ifdef ENGLISH
        printf("Hello World!\n");
    #endif

    scanf("%c", &pause);
    return 0;
}

What I do not like about this is it still relies on `printf()` simply to join an international character with string text... see next post:

Edited October 10, 2022 by scottinNH
clarity, move question to next post

scottinNH · October 10, 2022

Here is a synthetic (not working) code, an example of what I want to do with this:

    #ifdef PORTUGUESE
        printf_special("Ol{LOWER_A_ACUTE} Mundo!\n", ); // Olá Mundo!
    #endif

I am seeking a way to be able to directly print international text -- without having to pass the international character as an argument to printf().

This would feel more natural, even if in the above (fake) example I have to escape the special character somehow.

At least it would mean you could prepare the text in advance using sed/awk replacement of "á" with a define.

If someone understands my goal, I could be overlooking a simpler approach.

Mainly I am trying to avoid use of `printf()` as so many have suggested. Cheers.

I was looking into how printf() works because I wanted to steal the escaping stuff, so that maybe `\LOWER_A_ACUTE` could be an escape sequence.

Maybe that is the right approach (?) but I got pretty deep into the weeds tracking all the sub-functions of printf().

Edited October 10, 2022 by scottinNH

ivop · October 10, 2022

11 hours ago, scottinNH said:

#define UPPER_U_UMLAUT          16      //  x10 	Ü	CTRL+P

This one is inconsistent. Umlaut is German. You used diaeresis for the others.

Edit: sorry, I missed you note about this in the comments :))

This still stands though:

Or perhaps support both spellings.

Edited October 10, 2022 by ivop

ivop · October 10, 2022

If you define the characters as strings, you can use simple C string concatenation.

Like:

printf("Ol" LOWER_A_ACUTE " Mundo!\n");

Note however how some of them clash with ASCII characters, like \t and \n (tab and newline). Or End Of String (\0x00).

Edited October 10, 2022 by ivop

scottinNH · October 10, 2022

5 hours ago, ivop said:
If you define the characters as strings, you can use simple C string concatenation.

Like:
printf("Ol" LOWER_A_ACUTE " Mundo!\n");
Note however how some of them clash with ASCII characters, like \t and \n (tab and newline). Or End Of String (\0x00).

Yup, another way, thanks.

I think "I" am happy to use printf ...but a lot of folks actively seek to avoid printf to conserve memory.

So I am trying to adhere to their goal, and come up with a small light framework others could use, help improve, or maybe clean up and submit to CC65 as a PR.

dmsc · October 10, 2022

Hi!

17 hours ago, scottinNH said:

This helps me use international characters (I don't have to lookup hex anymore).

What I do not like about this is it still relies on `printf()` simply to join an international character with string text... see next post:

There is a much better way in CC65 to make character code translations, see this example:

image.png.e759bb48e0344d45b495eaaf81a8fb4d.png

Compiling with:

cl65 -tatari -o example.xex example.c

Gives this:

The default charmap when compiling for the Atari is here: https://github.com/cc65/cc65/blob/master/include/atari_atascii_charmap.h , it only makes four changes:

07 -> FD (BELL)

09 -> 7F (TAB)

0A -> 9B (EOL)

0C -> 7D (FF, changed to clear-screen)

Note that for the accented characters translation to work, you must write your source code in an 8-bit text code, most modern editors use UTF-8 that uses more than one byte for the latin-1 chatacters.

Have Fun!

example.zip

scottinNH · October 11, 2022

5 hours ago, dmsc said:

Hi!

There is a much better way in CC65 to make character code translations, see this example:

Thank you so much!

From another thread, had gotten the wrong impression you needed to use low-level putchar() (in order to get around some unwanted character translation issue).

Therefore I was working to create something easier to use, but we already have it. Awesome! :-D

Now I can scale my effort back: simply create a multi-language example to contribute as documentation. Cheers.

ivop · October 11, 2022

11 hours ago, scottinNH said:

From another thread, had gotten the wrong impression you needed to use low-level putchar() (in order to get around some unwanted character translation issue).

Which is what is more or less done here, too. It calls write(), which is raw stdio.

Edited October 11, 2022 by ivop

dmsc · October 11, 2022

Hi!

56 minutes ago, ivop said:

Which is what is more or less done here, too. It calls write(), which is raw stdio.

Yes, so this also generates smaller code.

Basically the CC65 library does two translations when using stdio functions to provide compatibility with standard 😄

- Standard "FILE*" pointers are mapped to integer unix-like file-descriptors.

- File descriptors are mapped to CIO channels.

Also, CC65 reuses channels, so the standard input/output and error file-descriptors (0, 1 and 2) are all mapped to I/O channel #0.

Have Fun!

scottinNH · October 20, 2022

On 10/10/2022 at 6:35 PM, dmsc said:

Note that for the accented characters translation to work, you must write your source code in an 8-bit text code, most modern editors use UTF-8 that uses more than one byte for the latin-1 chatacters.

Hmm OK YUP I just ran into that issue: VS Code works in UTF-8.. 🙂

.. for the thread, FYI you can tell VS Code to put a file into ISO 8859-1 (click on "UTF-8" at the bottom of the window)

..but my string (with tilde n) will "look correct" in VS Code, but print wrong on the Atari. So it's not working or I miss a step.

When I open your example.c and compile it with cl65, it displays correctly on atari800

...but when viewing source in VS Code your string is

A�o Nuevo\n

The problem makes sense to me... encoding conflict. Solutions however makes less sense (I come from Python and Perl and did not deal with i18n)

Do you (anyone here) know how to 8-bit text working properly in VS Code? (on per-file basis, or at least per-project)
What are you using for a text editor? I can use something else for this work.

Cheers

baktra · October 20, 2022

There are two factors:

1. Character Encoding of your source file

2. Character Encoding that is expected by the cc65

If these two are a mismatch, you can expect weird results.

Character encoding of your source file is under your control.

Character encoding expected by cc65 is platform-specific and can be different for let us say macOS and Windows.

To answer your question about VS.

Open the file, click the UTF-8 at the bottom, select Reopen with encoding and then select CP-1252.

Edited October 20, 2022 by baktra

dmsc · October 26, 2022

Hi!

On 10/20/2022 at 1:12 AM, scottinNH said:

Hmm OK YUP I just ran into that issue: VS Code works in UTF-8.. 🙂

.. for the thread, FYI you can tell VS Code to put a file into ISO 8859-1 (click on "UTF-8" at the bottom of the window)

..but my string (with tilde n) will "look correct" in VS Code, but print wrong on the Atari. So it's not working or I miss a step.

When I open your example.c and compile it with cl65, it displays correctly on atari800

...but when viewing source in VS Code your string is

A�o Nuevo\n

The problem makes sense to me... encoding conflict. Solutions however makes less sense (I come from Python and Perl and did not deal with i18n)

Do you (anyone here) know how to 8-bit text working properly in VS Code? (on per-file basis, or at least per-project)