Technical summary: a case study in taking a non-localized application and coordinating implementing it using Locale::Maketext (which see Locale::Maketext in CPAN, and our discussion in TPJ13, )

Localizing Open Source Software

Sean M. Burke

As Perl Journal/Sys Admin readers, most of you are probably native speakers of English — but that puts you in the minority of computer users worldwide. If you think that you’ve got it rough just trying to remember all the switches to ls, think about the hundreds of millions of people who put up with the headache of dealing with applications whose interfaces are not localized (i.e., translated) into their native languages. It’s a bad situation, but open source software can make it better.

Many programs typically aren’t localized because they consist of proprietary software that was designed without easy localization in mind. That means later localization can be done only by the company that produced the original software, and only if that company decides there’s market enough to justify the expense of translating the software’s interface strings and getting a programmer to compile a new version of the software with those strings. So, if you’re Estonian (for example) and you want to use Excel with all the menus in Estonian, you have to wait for Microsoft to make an Estonian version of Excel — and if they don’t, then you’re just out of luck.

We can do better in open source by writing software with the goal that localization should be easy both for the programmers and maybe even for eager users. (After all, practically the definition of “open source” is that it lets anyone be a programmer, if they are interested enough and skilled enough.)

This article will not tell you the micro-details of using the particular localization systems like gettext or Locale::Maketext — both these systems come with tutorials that are good for such low-level considerations. Instead, this article is about the large-scale process of dealing with the volunteer translators that you’ll work with in your open source project. I’ll begin with a brief discussion of the most recent localization project that I managed, then describe some lessons I’ve learned from it and other projects.

Localizing Apache::MP3

The mod_perl module Apache::MP3 is for presenting a friendly interface to a collections of music files (generally MP3s, but now also OGGs). It’s been around a while, but it was just this April that I suddenly noticed that many of my friends were using it for their music libraries. One friend in particular showed me his music collection via an Apache::MP3 interface, and I noted that while he was a native speaker of Japanese, and most of his music files were named in Japanese, the interface he was using was stubbornly in English. I thought, “This needs fixing!”

So I decided to try my hand at translating Apache::MP3’s interface into as many languages as possible. The Apache::MP3 author, Lincoln Stein, said he thought it was a great idea, and that I should run with it. The easy part was finding every user-visible English string in the Apache::MP3 source, and replacing it with a call to a function that got that string’s translation, for whatever language the user prefers. So, for example, this:

print "<a href='$qhurl'>Quick Help Summary</a>\n";

is replaced with this:

print "<a href='$qhurl'>", x("Quick Help Summary"),
  "</a>\n";

Then the x() function is responsible for looking up the string “Quick Help Summary” in the lexicon of phrases for the user’s language. Going through Apache::MP3’s code and putting in calls to the translator function actually only took an hour or so.1 Then I grepped the file for every occurrence of “x(” and built a lexicon of all such phrases, like so:

"Quick Help Summary" => "...",

...where the key on the left is the string being looked up, and the value on the right is where a translator should provide a string that should appear when the software is looking for how to say “Quick Help Summary”. This is the format used by the localization framework Locale::Maketext, and other systems will use slightly different formats, like with alternating lines of keys and values. The “phrases”, as I call lexicon entries, can be real multi-word phrases as seen here:

"Quick Help Summary" => "...",

Or they can be single words, like these column headings for music file listings:

"Artist"   => "...",
"Track"    => "...",
"Filename" => "...",
"Bitrate"  => "...",

Phrases can be abbreviations:

"Sec" => "...",

Or they can be long complete sentences:

"In this demo, streaming is limited to approximately 
  _ seconds." => "...",

Then it was a simple matter of sending the lexicon file off to friends and acquaintances who are Perl programmers fluent in other languages (a German here, a Finn there, etc.) and asking “Could you translate this into [your language]?”, after duly explaining what it was for. The whole lexicon file contained only about thirty entries, so it wasn’t a terrible imposition for people. But in then discussing things with the translators, I learned a lot of lessons that I will now summarize in the form of advice to follow when attempting your own localization project.

Start out localizing to just one other language

Once you’ve completed the lexicon file, it’s tempting to say, “well, now I’m done!” and to then type up a message email explaining the project, and send it off to a dozen people at once (one for each language). However, this is a bad idea, because although your lexicon file is totally clear to you, parts of it may not make any sense at all to anyone else. If you send it to a dozen people, you’ll have to answer a dozen exasperated messages asking, “what do you mean by “Track” — the track name, or the track number?”

It’s much easier to localize to just one other language for starters — I picked French for Apache::MP3, partly because I’m passably conversant in French (unlike, say, Russian), and partly because I know a lot of Perl programmers who are native speakers of French; many, in fact, already users of Apache::MP3. This gave me an opportunity to put the lexicon file through its paces with this “practice language”. When those helpful French-speaking programmer friends asked what I meant by “Track”, I answered them but also summarized my explanation in a comment in the lexicon template-file, so that translators of future localization modules would see that explanation right from the start. The comments that I built up based on the “practice language” probably cut in half the amount of time that future translators had to spend thinking about how best to translate the interface strings.

Explain every phrase’s meaning

It’s tempting to believe that the meaning of a phrase in an interface comes just from the words in the phrase, and that it should be self-evident how to translate that. However, much of the meaning of phrases in interfaces comes from the context. Consider, for example, the column heading “Time” in Apache::MP3. When a user sees it, it will be as a heading for a column containing items like 3’19”, 5’21”, 14’08”, and 0’09”. He’ll note that these look like durations and interpret “Time” as meaning the amount of time that each track takes to play. But the word “Time” just by itself doesn’t carry that meaning, and if a translator sees this line in a lexicon file:

"Time" => "...",

... he’ll have no idea what context to use. After all, you could just as well mean the time (like 1:53pm) when a song will start playing. To translate the term “Time”, the translator must know whether it means “the time that it takes” or “the time when it starts”. One way to explain this to the translator is to paraphrase the meaning in a comment:

"Time" => "...",   # How long the song lasts

Another way (and I suggest doing it both ways, whenever possible) is to give the translators some screenshots of the English interface, so that they can see where the phrases appear. If they see “Time” as a heading for a column with items like 3’19”, etc., then that picture is worth a thousand words of explanation. It will even illustrate some points that it would not even occur to you to explain. An Apache::MP3 example of this is that “Artist” is used as a heading for a column that sometimes contains band names (“Massive Attack”, “The Shins”, etc.), sometimes an individual musician (“Brian Eno”), sometimes a composer (“Bach”), and sometimes a performer (“Pavarotti”) — but because this was clear in the screen-shot, none of the translators had to ask me to clarify whether “Artist” could be translated using a more specific term like “Singer” or “Composer”.

It is sometimes necessary to explain not just the meaning of a phrase, but also how the program will use it. For example, in Apache::MP3, the lexicon entry “Quick Help Summary” is used both as a link to the help page and as the title for that help page. However, if a translator thought that it was used only as a link to that help page, then he might be tempted to translate it as “Click here if you need help” — which would clearly not work as a title for that page.

Let each translator try things out

The best scenario for translators experimenting with different translations is a situation where they can install the software that they’re localizing, and edit a lexicon file. For various reasons, this wasn’t possible with Apache::MP3. But instead of simply accepting each translation with a mere “Thank you, now we’re done with this language!”, I took a screenshot of the application with that new lexicon plugged in, and I sent that screenshot back to the author with a note like, “Here’s a screenshot; what do you think?”. The time that it took me to reply to translators’ messages with a screenshot was typically about a day, which was short enough that they didn’t think I’d forgotten about them, but long enough that they could look at the screenshot with fresh eyes and often reconsider some translations. Also, when they saw the screenshot, many translators spotted typos that they had missed the first time.

Explain physical constraints on phrases

If a phrase needs to be short in order to fit in a particular widget, you should make that clear in the comments for the translators:

"Time" => "...",
 # Try to use a short phrase here.  Not over 10 characters, please.

If a phrase needs to be only a single word with no spaces or punctuation in it, or needs to be in all capitals for some reason, you need to make that clear, too. But try not to make these restrictions seem too absolute; sometimes the only way to translate something is with a long word, or with several words. So be open to suggestions and/or objections from the translators.

For example, the Russian Apache::MP3 translator came up with the rather long word “prodolzhitel’nost” (“duration”) as the column heading for the run-times of files (3’10”, etc.). However, that word comes out much longer than any of the items in that column, so it ends up looking very strange on the screen. I failed to anticipate this, but it was obvious when we saw the screenshot. I asked the translator whether a shorter word could be used; he thought of the word “dlina”, but warned that it was more vague, meaning just “length” — whether length in time, or length in bytes. I said there would never be a column for length in bytes, and so the vagueness of the term should be all right. So we settled on “dlina” as being the right size.

Be prepared for trouble with jargon terms

Once we adopt a conventional word for something, we tend to think of it as ordinary and self-explanatory. So when we talk about the bit rate of an MP3 file, we have the feeling that “this file’s bit rate is 160 kbps” is as ordinary and simple an expression as “that car is going 65 mph”. However, not only does “bit rate” convey a more abstract and complex concept, it’s a very recent jargon term, and so you should not be surprised if, for example, the Dutch have not settled on a translation for it. In fact, don’t be surprised if your translators aren’t sure what a particular jargon term means, or how it’s distinct from another term in the lexicon.

For example, while most Apache::MP3 translators had no trouble producing a translation for the jargon terms “file” and “directory”, every one of them was very hesitant with “bit rate” and “sample rate” — even the translator whose specialization was in digital audio signal processing! Some translators decided that just leaving the terms in English would be best. Other translators invented a term as needed, like Icelandic “[term missing? --SMB]” literally “bit-flow”, invented as a translation for “bit rate”. And others used good paraphrases that literally meant things like “sound quality”, “acoustic fidelity”, “compression”, “compression rate”, “kilobits per second”, etc. These variations show that often ordinary language is better than precise language — or as many a project manager has phrased it, “perfect is the enemy of done”.

Also, consider whether you really need a jargon term’s translation to be a jargon term. For example, I advised the translators that there would be no harm in translating “Stream this file” as just “Play this file”, and many of them translated it with a simple verb meaning “to play [a song]” instead of trying to use an awkwardly invented verb for “to stream”. In fact, this might even lead you to consider whether you need the jargon term in the original English interface — can you get away with saying “letters” instead of “font”, or “format” instead of “codec”, or “start” instead of “launch”, or “program” instead of “application”? Admittedly, in some situations you really do need the more precise word; and in yet other situations, there is just no alternative for a jargon word — “interlacing”, for example.

Have exactly one main translator per language

If you ask several people for help localizing something into Spanish (for example), and they all agree and immediately write back with their own independently done translations, you will be in the perplexing situation of having many subtly different documents in a foreign language, with no particular way to decide which is the best one. You could multi-way diff them all and mail the result to everyone. Then you’ll hear many translators saying, “Oh, the way he translated that one phrase is much better than what I thought of”.

But this can quickly turn into an organizational nightmare for you. And, in the end, you’ll still be left with a good number of irresolvable translation differences, like if each of three translators has their own favorite word for “directory”, each considering the others’ to be acceptable but not quite as good as his own. So you should have exactly one main translator per language, who should be in charge of dealing with other translators, and who can make informed and authoritative decisions about others’ suggestions. Then when he decides on a particular translation and gives it to you, it’s final.

Encourage each translator to have a partner/proofreader

Above I said to have “exactly one main translator” — but that doesn’t mean he or she should do it alone. Just as pair programming goes so well in industry (as Extreme Programming fans have recently noted), translation goes particularly well in pairs.

Moreover, it’s hard to find even one translator who is proficient enough in software design that he understands localization and who is a good translator and who has perfect spelling. But it’s much easier to instead find someone who knows software design and can do okay as a translator, but also knows a fluent native speaker who can check his translation.

An alternative arrangement is where the “computer expert” (who might even be you!) is not fluent in the target language, but works with one or two people who are fluent educated speakers of the language. In that situation, the “computer expert” has the worthwhile challenge of explaining the program to the “language experts”, which will likely prove the truth of Einstein’s famous aphorism that “You do not really understand something unless you can explain it to your grandmother”. Such sessions may lead to questions or comments that lead you to redesign the interface — and I think we all know a few open source programs whose interfaces could have used a good redesign!

Design for change, with English fallbacks

Once you’ve localized your program to a few notably useful languages, you’ll eventually want to add a feature to the program in such a way that requires the interface to use a new phrase for which you don’t yet have translations. You should be able to add this feature for English (assuming that English is the program’s “native language”), but what happens when the program tries to look up this phrase in a different language, for the users of the German interface for example? If the program dies with an error “PHRASE 5153 NOT FOUND”, clearly this is no good. Simply rendering the English phrase as “[PHRASE 5153 MISSING]” is not much better.

A friendlier approach (and one that is quite easy with Locale::Maketext, incidentally) is to fall back to the English phrase when no translation can be found in the user’s language. This way, the programmer can feel free to change parts of the interface without totally ruining things for all the foreign users. You can explain to users that the alpha versions will have bits of English here and there, and that you plan to fix that in the beta. You can email the translators requesting a new phrase or two once you’ve done all the hard work of getting the feature otherwise working.

Thank the translators

Not only should you note who the translators are for your own records (so that you can email them if you need another phrase or two translated), you should also publicly thank them in the documentation for at least that lexicon, so that they get credit where it’s due. The more prominent the credit, the happier they’ll feel about the project, and the more likely they are to take an active interest in its future — maybe becoming beta testers or finding other people to help localize the project into new languages.

Welcome new translators

You should make it easy for people not only to change existing localizations, but also to add new ones. For example, in Apache::MP3, each language’s localization is just a short .pm text file that can be changed as needed. Moreover, at time of writing, there is no Greek localization, but if an interested Greek user wanted to make one, he would only need to take the French .pm file, copy it to an appropriately named new file, and replace all the French lexicon values with Greek ones.

The fact that localizations can be corrected or added like that should be prominently explained in the documentation. You should explicitly tell the users that you gladly welcome contributions, and you should clearly explain the process of making a new file and submitting it to you. Also make clear that if the prospective translators have any questions, they should just email you. In my experience, there are immense numbers of clever and eager people all over the world who are learning about programming, and who have been thinking about contributing to open source software, but don’t know where to start. If they see that you welcome contributions of localization files, this will give them a friendly opportunity to learn about the process of producing open source software. And last but not least, it will make life easier and happier for all the users who actually speak the languages that your software will be localized for!

Sean Burke’s new book Perl & LWP, from O’Reilly & Associates, is in bookstores now. By time you read this, the new version of Apache::MP3 should be available in CPAN along with localization files for dozens of languages. Sean wrote Locale::Maketext, used in localizing Apache::MP3. You can email Sean at: sburke@cpan.org.

1 That brings up an open question: if it is so easy to prepare software for localization after the fact, should we bother to try to localize right from the beginning? After all, worrying about localization from the very start might be a distraction from getting the application initially working at all.
[return to text]

[Back to list of articles]