How to localise your thing

ggn · April 25, 2020

Seems like yesterday's post sprang more questions to my friend who is too shy to post on the forum so he sends me private messages! Don't be shy folks, ask away publicly instead of PM, nobody will judge.

Anyway, the question was how would one tackle localisation in a game. I'll cover the string printing part here, but if your project has the text drawn as graphics you could also pick up a couple of ideas as well.

Easy (dumb) mode

The first obvious thing to do is to have a variable that holds which language we're using.

' Global variable which will hold the language
dim language%

Now, we could be very dumb about this and just remember in our head which language is which number and move on. So, something like

' 0=English, 1=French, 2=Italian

And when it's time to print a string we can do something equally simple like

if language=0 then
    print "(Our message in English)"
elseif language=1 then
    print "(Our message in French)"
 elseif language=2 then
    print"(Our message in Italian")
endif

And this is really ok. It'll work nicely if you only have 3-4 messages in the game and you don't need to worry about it.

Let's get organised

But say you suddenly have 30 text strings in your game instead of 3. Now you have to go to each place you print messages and change or add the "if" statements. Or wait for your translator to translate all the strings, and hope they find all the places that need to change.

And then, you decide to add a fourth language - argh! Well ok, go back, change all ifs, bish bash bosh. And then you decide that the second language is to be removed - double argh!

Do this a few times, tinkering things around the source and you're going to be in a bad place mentally and won't feel like doing it at all.

So let's try to leverage some of the power high level languages provide us in order to organise things better!

Enumerations

As a first step, let's get rid of the hardcoded values for languages. We could do something like:

const English=0
const French=1
const Italian=2

Again, perfectly fine for most cases, but we can also use an enum for that:

enum
    English
    French
    Italian
end enum

This is pretty much equivalent to the above, but does the numbering automatically starting from zero. So if you for some reason choose to insert German language between French and Italian, then the enum will equate German to 2 and Italian to 3. Not a biggie, but it's one thing less off your mind!

Dimensioning strings

Now let's try to get rid of all that if/endif block. We do this by using our old friend, DIM:

dim message_pressfire$[3,40] as char
message_pressfire$[English]="Press fire to start              "
message_pressfire$[French] ="Appuyez sur le feu pour commencer"
message_pressfire$[Italian]="Premi il fuoco per iniziare      "

What we did here is to tell the compiler to create an array of strings. We have 3 languages, we know that each message is less than 40 characters, so [3,40] it is.

One small detail to notice is the as char at the end of that line. This actually tells the compiler to not use the default string size that DIM uses, which is 2048. So if we actually wrote "dim message_pressfire$[3]" it would still create an array of 3 strings, but each would be 2048 bytes. This wastes a lot of RAM as you understand!

Finally, we can replace that if/endif block with

PRINT message_pressfire$[language]

language of course must be initialised first to one of the values from our enum, for example "language=English".

Entering the multiverse

The above isn't too bad and can again suit your needs. But even this can get messy since you have to have multiple named arrays, remember their names. Depending on your programming style it can be a problem.

So, let's throw more dimensions and MOAR enums at the problem!

First of all, let's enum all the messages we're going to use:

enum                ' define some constants for our numbers instead of using 0,1,2,3,4
    msg_hi=0
    msg_bye
    msg_win
    msg_lose
    msg_ready
    num_messages    ' This must be at the end!
end enum

num_messages will always auto update to hold the actual number of messages. So if you add 3 more messages it'll be equal to 8!

So let's now DIM an array that will hold all our messages for all languages:

dim messages$[3,5,48] as char

So compared to our DIM above this has added a third dimension that will hold all our messages.

So let's populate the array (localisation provided by Google translate so it's probably terrible!):

messages$[English][msg_hi]="Hi   "
messages$[French] [msg_hi]="Salut"
messages$[Italian][msg_hi]="Ciao "
messages$[English][msg_bye]="Bye      "
messages$[French] [msg_bye]="Au revoir"
messages$[Italian][msg_bye]="Addio    "
messages$[English][msg_win]="You win!    "
messages$[French] [msg_win]="Vous gagnez!"
messages$[Italian][msg_win]="Hai vinto!  "
messages$[English][msg_lose]="You lose   "
messages$[French] [msg_lose]="Tu as perdu"
messages$[Italian][msg_lose]="Hai presso "
messages$[English][msg_ready]="Get ready"
messages$[French] [msg_ready]="Sois pret"
messages$[Italian][msg_ready]="Preparati"

A few small details here:

The "48" is hardcoded as the maximum string length for all the messages. Take care not to overshoot this or you might overwrite other things in memory!
Each message is padded with spaces to match the longest language for each message. This is not necessary, but a good practice: if you switch languages and want to re-print the new language's messages on top of the old ones, you might get some of the old message leftovers.
I didn't use num_messages instead of "5" above because of some limitation with BCX. I couldn't find a way to overcome it for now, but if anyone has an issue with this in practice we can certainly revisit the issue.

So if we want to print out our "lose" message in Italian, it's as easy as typing

print message$[Italian][msg_lose]

Or, hey, let's print all the messages and have a language selection!

do
    local i as short

    vsync
    ZEROPAD()
    if zero_left_pad band Input_Pad_C then
        language=English
    elseif zero_left_pad band Input_Pad_B then
        language=French
    elseif zero_left_pad band Input_Pad_A then
        language=Italian
    endif

    for i=0 to num_messages
        rlocate 0,32+8*i
        rprint messages$[language][i]
    next i
loop

So this will change all the printed messages as you press A,B,C buttons. Not too bad, right? Try doing that with if/endif blocks, see how fast the source will clutter! (of course this isn't a really practical example per se, with a little more code it could be modified to print various strings at various positions on screen)

Closing time!

Hopefully this helps people out. As usual, the code is up on Github and you can build it yourself by downloading/cloning/pulling the latest repostitory and building project "localisation". Study the source, use it at will, come back with questions if you have them. But have fun regardless!

Fredifredo · April 26, 2020

Sadly I found a bad thing in RB+ : Don't support french language...

"ê" "é" "à" "ë" "ï" crash the rprint ...

And this "`" take me 25 mins to understand it wasn't " ' "...

ggn · April 27, 2020

Well it's too late for me to type a lengthy reply about this, so I'll just leave this here for now:

For the impatient, just grab the latest source from github and study the example project and modified font

ggn · April 28, 2020

All right, I've got some rest and can probably make more sense than last night, so let's dive straight in.

Computers have no idea what languages are

What the title means is that when you ask a computer to print a message on screen, what really happens is that you call a routine, point it to a series of bytes and a series of images. Then the print routine reads each byte value, takes the corresponding images and draws them to the screen.

Here is the default font used in raptor:

The leftmost character, space, corresponds to ASCII value 32 and the rightmost character 'Δ' corresponds to 127. So there. That's all the characters we have at our disposal. Raptor doesn't support any ASCII values bigger than 127 or smaller than 32.

(Small grumble there: this is one of the times where using a closed source library bites us. It would have been fairly simple to extend the limit at least up to 250 or so, rebuild the library and ship it. But it is what it is)

However, we're not out of options! If we absolutely must keep the upper and lower case letters and numbers, that leaves us with quite a few glyphs we probably won't use. So we can replace them with others. Hooray!

So, for example, to print a "é" character with this modified font we will enter "#" in our string. So for example, to print "gagnée" we would type "gagn#e".

This is, of course, really painful because now one has to remember where all the funny symbols and use the translated characters accordingly. Can we do something better?

The tragic tale of character encodings

Actually this is a very long story so I'll cut to the chase. Interested people can probably look this up for a more accurate version.

So when computers were younger, memory was at a premium, and computer standarisation was still in its infancy, people only used a byte per character as we said. But with so many languages and special glyphs, there's only so many you can cram in 128 slots. (Generally computers had up to 256 characters but the lower 128 was usually well defined by the ASCII standard. So people were left with 128 free).

Because there was no centralised standard for non-English characters each company that imported their computer to a country pretty much rolled their own version for each language. For example Greek was a nightmare: one simply couldn't copy a text in Greek (for example) from an Atari to an Amstrad, the text would come out pretty much garbled at the other end.

So, fast forward a couple of decades and these problems were solved with standards like UTF-8. But at a small cost: the extended characters are not a byte each any more.

Let's check out a modern editor that a lot of people use on Windows: Notepad++. Suppose we try to enter the following string into our rb+ source: "êâîôûù". By default this is what we'll see:

image.png.d0684b4e9cde2484c7e811c374248355.png

That.... didn't go as expected. Why is that?

image.png.32e62b48b1c04e75cd4ed92e94000812.png

Oh, it's set to the old ANSI mode, i.e. 1 byte per character. But there's an option to convert to UTF-8. Let's try it out:

image.png.69eeba30991dfe5c36e80815dd3492fd.png

Nice! Let's save out source and run it, right? Except nope, because once we enable UTF-8 this is what actually gets written to the file:

Hmm, for each accented character 2 bytes are written to our file. And each one starts with hex $c3. Which is 195 in decimal. Which is over 128. Feed that into raptor and it'll go boom! So, is that another dead end?

I'm going through changes

Well, what if we were to convert the multi-byte characters to our desired glyphs we know before printing the string?

function convert_utf8$(string$)
    local c as UBYTE
    local d as UBYTE
    local i as short
    local j as short
    local converted$
    j=0
    
    for i=0 to len(string$)
        c=peek(((int)string$)+i)
        if c=0xc3 then
            d=peek(((int)string$)+i+1)
            select case d
                case 0xaa 'ê
                    c=41
                case 0xa2 'â
                    c=36
                case 0xae 'î
                    c=60
                case 0xb4 'ô
                    c=43
                case 0xbb 'û
                    c=125
                case 0xb9 'ù
                    c=126
                case else
                    'We don't know this character. Just turn it into A?
                    c=65
            end select
            i++     'move past $c3
        endif
        poke strptr(converted$)+j,c
        j++
    next i
    poke (strptr(converted$)+j),0
    function=converted$
end function

This is a very dumb routine that goes over each byte of a string. If the character is not a multi-byte one (i.e. doesn't start with $c3) then we're fine. But if a $c3 is encountered then it checks to see which character it is and converts it to the value we know is correct. So "ê" is 41 (which is the ")" character originally, but repurposed). Write that into a new string, pass it back to the caller, done.

So before we print our string with accented characters we call convert_utf8 and get the proper string back. In the updated localisation example we can see the following line that demonstrates its use:

rprint convert_utf8("Special characters can work... êâîôûù")

And that's all there's to it.

Closing words

There's still some work to be done here as not all the characters are handled in that routine above. Firstly, one will have to decide where to place the characters that have to be converted (I just did it at random and pretty much sacrificed all punctuation). Then, all the UTF-8 characters have to be identified in a hex editor and then entered into the subroutine, as well as the corresponding single byte characters they will get mapped to. But it really is not very difficult, and the results justify it in my opinion: especially if there's a lot of text to be written, this will greatly simplify the writing.

One final trick here: if you absolutely need both the original font and the localised, you can keep the first in the upper characters (white in the font bitmap) and place the accented font in the lower characters (red). The colours are of course user defined as well, so you can make them all white, red, or whatever you wish. Use basic_r_indx to switch fonts before printring.

That's all, have fun!

Edited April 28, 2020 by ggn

Fredifredo · April 28, 2020

Impressive

One problem > one solution

Congratulations to you ggn !

Time for me to work with that ... More of that later on

+CyranoJ · April 28, 2020

8 hours ago, ggn said:

(Small grumble there: this is one of the times where using a closed source library bites us. It would have been fairly simple to extend the limit at least up to 250 or so, rebuild the library and ship it. But it is what it is)

But then rB+ uses API v1 - I'll add that to the next patch for API v2.0, but I've seen no move to change rB+ to use the new API and get all the new features and bug fixes.

Sign In

How to localise your thing

Recommended Posts

ggn

Link to comment

Share on other sites

Fredifredo

Link to comment

Share on other sites

ggn

Link to comment

Share on other sites

ggn

Link to comment

Share on other sites

Fredifredo

Link to comment

Share on other sites

+CyranoJ

Link to comment

Share on other sites

Join the conversation

Recently Browsing 0 members

Apps

My Activity Streams

More