Localizing Your Perl Programs

Packages Described

Locale::Maketext, Locale::Gettext: CPAN

GNUgettext: ftp://prep.ai.mit.edu/pub/gnu/gettext/

Once upon a time, when the Internet was merely an experiment in getting a few dozen machines to talk to each other without actually melting, and when computer science was about getting your accounting program to run in 5K of core, it didn't matter that program output was in English, only in English, and MAYBE EVEN ALWAYS IN CAPITALS. After all, computers were basically designed by and for a few American engineers, and as long as packets were swapped and numbers were crunched, everyone was happy.

But nowadays, computers are becoming part of daily life for much of the planet, and that means that the average user is less and less likely to be a native speaker of English. And software that doesn't work in your native language is very annoying, even if it does work in some other language you understand fluently.

The first step to making software "work", in your language of choice, is called internationalization (often abbreviated "I18N"). Internationalizing a piece of software, or a file format, or a protocol, basically means making sure it can convey text in any language. Mercifully, this has been mostly care of; modern protocols and data formats, like MIME'd email, HTTP, HTML, and XML, do a fine job of noting what character set your text is in, so that whatever program receives your text will know how to display it. And unlike in the old days, we now have standard character sets capable of representing text in almost any language: notably, there's Latin-1, which does fine for English and most other Western European languages, there's Unicode, which works for all languages, and there's also a slew of other language-specific character sets like KOI8 (among others) for Russian, JIS (among others) for Japanese, VISCII for Vietnamese, and so on. (That's the great thing about standards--there are so many to choose from.)

You can use an email program to write email in whatever language you want, but chances are the interface is still only in English. That software doesn't really "work" in your language of choice.

Making the interface to a program work in the user's language of choice is called localization (often abbreviated "L10N"), and that's what this article is about. For the programmer, localization means an extra bit of "bookkeeping", so to speak, so that instead of having bits of text hard-coded in your program's interface, they get looked up in a little lexicon module--so that if the user is using the program's French interface (assuming one has been provided), your program won't say "File not found", but instead will look up the French phrase for that and say "Fichier non trouvé". And where a GUI button used to say "Search", it now says "Cherchez".

The most widely used localization system is GNU gettext, and while it's a definite advance over previous systems, it and similar systems suffer from some basic deficiencies. Simply put, they don't deal well with the different ways that different languages phrase things. Before I propose solutions to these problems, I have devised a tale of woe to illustrate how frustrating these problems can be.

A Localization Horror Story: it could happen to you.

Imagine that your task for the day is to localize a piece of software someone else in your company wrote. Suppose it's a simple search tool of some sort, the exact details of which aren't important. Luckily for you, the only output the program emits is two messages, like this:

How hard could that be? You look at the code that produces the first item, and it reads:

First you have to look up what %g does--it performs number interpolation with nice formatting. But then you think about the above code, and you realize that it doesn't even work right for English, as it can produce this output:

...which does the Right Thing. (While looking up %g in the Perl docs for sprintf, you learned that %s is for interpolating strings.)

But you still have to localize it for all the languages spoken by your users, and after a little poking around in CPAN, you find the Locale::gettext module, which is an interface to gettext, a set of C routines that seem well suited to this task. After some poking around AltaVista, you find the gettext manual. You browse through the tutorial, and, following its examples, you start to write:

But you see later in the gettext manual that this is not a good idea, since how a single word like 'directories' is translated depends on context. In languages like German or Russian, the 'directories' of 'I scanned 12 directories' demands a different case than the 'directories' of 'Your query matched 10 files in 4 directories'. The first is the object of a verb, and the second is the object of a preposition.

The boss decides that the languages du jour are Chinese, Arabic, Russian, and Italian, so you hire one translator for each and ask for translations of 'I scanned %g directory' and 'I scanned %g directories'. When they reply, you'll put that in the lexicons for gettext to use when it localizes your software, so that when the user is running under the zh (Chinese) locale, gettext("I scanned %g directory.") returns the appropriate Chinese text, with a %g in there where printf can then interpolate the number $dir_scan (Locale primarily means a choice of language, and things that usually accompany that, like character sets, preferences for expressing numbers (whether one and a half is 1.5 or 1,5), and preferences for sort order, since not all languages have the same alphabetical order. Since we don't talk about those other preferences in this article, just think 'language' whenever you see 'locale'.).

Your Chinese translator mails right back--he says both of these phrases translate to the same thing in Chinese, because, to use linguistic terminology, Chinese "doesn't have number as a grammatical category" like English does. That is, English has grammatical rules that depend on whether something is singular or plural; one of these rules is the one that forces nouns to take a suffix (usually 's') when there's more than one ("one dog, two dogs"). Chinese has no such rules, and so has just one phrase where English needs two. No problem; you can have this one Chinese phrase appear as the translation for the two English phrases in the zh gettext lexicon for your program.

Emboldened by this, you dive into the second phrase that your software needs to output: "Your query matched 10 files in 4 directories." You notice that if you want to treat phrases as indivisible, as the gettext manual wisely advises, you need four cases to cover the permutations of singular and plural on each of $dir_count and $file_count. So you try this:

(The case of "1 file in 2 [or more] directories" could, I suppose, occur with symbolic links in the filesystem.)

This isn't the prettiest code you've ever written, but this seems the way to go. You mail the translators asking for translations for these four cases. The Chinese guy replies with the one phrase that these all translate to in Chinese, and that phrase has two %gs in it, as it should--but there's a problem. He translates it word-for-word: "To your question, in %g directories you would find %g answers." The %g slots are reversed. You wonder how you'll get gettext to handle that.

But you put it aside for the moment, and optimistically hope that the other translators won't have this problem, and that their languages will be better behaved--that they'll be just like English.

The Arabic translator is the next to write back. First off, your code for "I scanned %g directory." or "I scanned %g directories." assumes there's only singular or plural. But, to use linguistic jargon again, Arabic has grammatical number, like English and unlike Chinese. However, it's a three-term category: singular, dual, and plural. In other words, the way you say 'directory' depends on whether there's one directory, two of them, or more than two of them. Your test of ($directory == 1) no longer does the job. And it means that where English's grammatical category of number necessitates only two permutations of the first sentence, Arabic has three--and, worse, in the second sentence ("Your query matched %g file in %g directory."), Arabic has nine possibilities where English had only four. You sense an unwelcome, exponential trend taking shape.

Your Italian translator emails you back and says that "I searched 0 directories" (a possible output of your program) is stilted, and if you think that's fine English, that's your problem, but that just will not do in the language of Dante. He insists that where $directory_count is 0, your program should produce the Italian equivalent of "I didn't scan any directories.'' And ditto for "I didn't match any files in any directories", although he adds that the last part about "in any directories'' should probably be omitted altogether.

You wonder how you'll get gettext to handle this; to accommodate the ways Arabic, Chinese, and Italian deal with numbers in just these few very simple phrases, you need to write code that asks gettext for different queries depending on whether the numerical values in question are 1, 2, more than 2, or in some cases 0, and you still haven't figured out the problem with the different word order in Chinese.

Russian, like German or Latin, is an inflectional language; that is, nouns and adjectives take endings that depend on their case (nominative, accusative, genitive, and so on; what role they play in the syntax of the sentence)--as well as on the gender (masculine, feminine, neuter) and number (singular or plural), as well as on the declension class of the noun. But unlike other inflected languages, putting a number-phrase (like "ten" or "forty-three") in front of a Russian noun can change the case and number of the noun, and therefore its ending as well.

He elaborates: In "I scanned %g directories", you'd expect "directories'' to be in the accusative case (since it is the direct object) and a plural, except where $directory_count is 1--then you'd expect the singular, of course. Just like Latin or German. But! Where $directory_count % 10 is 1 (assuming $directory_count is an integer, and except where $directory_count % 100 is 11) "directories'' is forced to become grammatically singular, which means it gets the ending for the accusative singular. You begin to visualize the code it'd take to test for the problem so far, and still work for Chinese and Arabic and Italian, and how many gettext items that'd take. But he keeps going. Where $directory_count % 10 is 2, 3, or 4 (except where $directory_count % 100 is 12, 13, or 14), the word for "directories'' is forced to be genitive singular--which means another ending. The room begins to spin around you, slowly at first... And with all other integer values, since "directory'' is an inanimate noun, when preceded by a number and in the nominative or accusative cases (as it is here, just your luck!), it does stay plural, but it is forced into the genitive case--yet another ending. And because the floor comes up to meet you as you fade into unconsciousness, you never get to hear him talk about the similar but subtly different problems with other Slavic languages like Polish.

The above cautionary tale relates how an attempt at localization can lead from programmer consternation, to program obfuscation, to a need for sedation. But careful evaluation shows that your choice of tools merely needed further consideration.

The Linguistic View

The field of Linguistics has expended a great deal of effort over the past century trying to find grammatical patterns that hold across languages; it's been a constant process of people making generalizations that should apply to all languages, only to find out that, all too often, these generalizations fail--sometimes failing for just a few languages, sometimes whole classes of languages, and sometimes nearly every language in the world except English. Linguists can make broad statements about the "average language", but the "average language" is as unreal a concept as the "average person"--no language (or person) is entirely average. The wisdom of past experience suggests that any given language can do just about whatever it wants, in any order, with any kind of grammatical categories--case, number, tense, real or metaphoric characteristics of the concepts that the words refer to, arbitrary classifications of words based on what endings or prefixes they accept, degree of certainty about the truth of statements expressed, and so on.

Mercifully, most localization tasks are a matter of finding ways to translate fixed phrases in their entirety, and where the only variation in content is in a number being expressed, as in the example sentences above. Translating specific, fully-formed sentences is, in practice, fairly foolproof--which is good, because that's what's in the phrasebooks that so many tourists rely on.

Breaking GETTEXT

Most sentences in a tourist phrasebook are of two types: ones like "How much do these ___ cost?'' where there's a blank to fill in, and "How do I get to the marketplace?'' where there isn't. The ones with no blanks are no problem, but the fill-in-the-blank phrases may not be straightforward. If it's a Swahili phrasebook, for example, the authors probably didn't bother to tell you the complicated ways that the verb "cost'' changes its inflectional prefix depending on the noun. The trader in the marketplace will still understand what you're saying if you say "How much do these potatoes cost?'' with the wrong inflectional prefix. After all, you can't speak proper Swahili, you're just a tourist. Tourists are supposed to be stupid. Computers are supposed to be smart. The computer should be able to fill in the blank, and have the result be grammatical.

In other words, a phrasebook entry accepts a parameter (the word that goes in the blank), and returns a value based on that parameter. In the case of Chinese, this operation is simple; in the case of Russian, it's quite complex.

This talk of parameters and complexity is just another way to say that an entry in a phrasebook is what we programmers call a "function." Just so you don't miss it, this is the crux of the article: A phrase is a function; a phrasebook is a bunch of functions.

The reason that gettext runs into walls is that it tries to use strings to do something that requires a function, which is futile. Performing printf interpolation on the strings you get back from gettext allows you to do some common things passably well, sometimes, sort of. But to paraphrase what some people say about csh script programming, "it fools you into thinking you can use it for real things, but you can't, and you don't discover this until you've already spent too much time trying, and by then it's too late."

Replacing GETTEXT

So, what we need to replace gettext is a system that supports lexicons of functions instead of lexicons of strings. An entry in a lexicon from such a system should not look like this:

Now, there's no particularly obvious way to store anything but strings in a gettext lexicon, so it looks like we just have to start over and make something better, from scratch. I call my shot at a gettext-replacement system "Maketext", or, in CPAN terms, Locale::Maketext.

When designing Maketext, I planned its main features in terms of "buzzword compliance."

Buzzwords: Abstraction and Encapsulation

The complexity of a language is abstracted inside (and encapsulated within) the Maketext module for that interface. When you call:

you don't know (and in fact can't easily find out) whether this will involve lots of figuring, as in Russian, or relatively little, as in Chinese. That kind of abstraction and encapsulation may encourage other pleasant buzzwords like modularization and stratification, depending on what design decisions you make.

Buzzword: Isomorphism

"Isomorphism" means "having the same structure or form"; in discussions of program design, the word takes on the special, specific meaning that your implementation of a solution to a problem has the same structure as, say, an informal verbal description of the solution, or maybe of the problem itself.

First, it's not well abstracted. These ways of testing for grammatical number should be abstracted to each language module, since how you get grammatical number is language-specific.

Second, it's not isomorphic. The verbal "solution" to our problem is "The way to say what you want in Chinese is with the one phrase 'For your question, in y directories you would find x files'"--and so the implementation should be a straightforward way to spit out that one phrase with the numerals properly interpolated. The complexity of one language shouldn't impede the simplicity of others.

Buzzword: Inheritance

There's a great deal of reuse possible for sharing phrases between modules for related dialects, or for sharing auxiliary functions between related languages. (By auxiliary functions, I mean functions that don't produce phrase-text, but answer questions like "does this number require a plural noun after it?" Such auxiliary functions would be used internally by functions that actually do produce phrase-text.)

Let's assume that you have an interface already localized for American English. Localizing it for UK English should be just a matter of running it past a British person with the instructions to indicate which phrases need rewordings or minor spelling tweaks. The UK English localization module should have only those phrases that are UK-specific; all the rest should inherit from the American English module. The same situation should apply with Brazilian and Continental Portuguese, possibly with some very closely related languages like Czech and Slovak, and possibly with the slightly different versions of written Mandarin Chinese, as I hear exist in Taiwan and mainland China.

For auxiliary functions, consider the problems with Russian numbers. Obviously, you'd want to write only once the hairy code that, given a numeric value, returns which case and number a noun should use. But suppose you discover, while localizing an interface for, say, Ukrainian (a Slavic language related to Russian, spoken by several million people), that the rules are the same as in Russian for quantification, and many other grammatical functions. While there may well be no phrases in common between Russian and Ukrainian, you could still choose to have the Ukrainian module inherit from the Russian module, just for the sake of inheriting all the various grammatical methods. Or, better, you could move those functions to a module called East_Slavic, from which Russian and Ukrainian could inherit, but which itself has no lexicon.

Buzzword: Concision

Okay, "concision" isn't a real buzzword. But it should be, so I decree that as a new buzzword, concision means that simple common things should be expressible in very few lines (or maybe even just a few characters) of code--call it a special case of "making simple things easy and hard things possible." It played a role in the MIDI::Simple language, discussed later in this issue. Or just think of it this way: usefulness plus brevity equals concision.

You may sense that a lexicon consisting of functions like these would quickly get repetitive. And you may also sense that you don't want to bother your translators with having to write Perl code--you'd much rather that they spend their very costly time on actual translation.

In a first-hack implementation of Maketext, each language-module's lexicon looked like this:

but I immediately went looking for a more concise way to denote the same phrase-function--a way that would also serve to denote most phrase-functions in the lexicon for most languages. After much time and thought, I decided on this system:

If we find such a function, we call it with $lang as its first parameter, and a copy of scalar(@messages) as its second. If that function was found in string shorthand instead of as a real subroutine, parse it and make it into a function before calling it.

where quant() is a method you've written to quantify the noun ('piece') given a number ($params[0]).

However, not everything you can write in Perl can be expressed in this shorthand--not by a long shot. For example, consider our Italian translator, who wanted the Italian for "I didn't find any files" as a special case, instead of "I found 0 files." That couldn't be specified (at least not easily or simply) in our shorthand system, and it would have to be written out in full, like this:

Next to a lexicon full of shorthand code, this sticks out like a sore thumb--but it is a special case, after all; and at least it's possible, if not concise.

As to how you'd implement the Russian example from the beginning of the article, well, There's More Than One Way To Do It. It could be something like this (using English words for Russian, just so you know what's going on):

This shifts the burden of complexity to the quant() method. That method's parameters are: the number, the Russian word it's going to quantify; and the parameter accusative, which means that this sentence's syntax wants a noun in the accusative case.

Now, the Russian quant() method here is responsible not only for implementing the strange logic necessary for figuring out Russian number-phrases, but also for inflecting the Russian word for "directory." How that inflection is to be carried out is no small issue, and among the solutions I've seen, some are straightforward but not very scalable, and others involve more complexity than is justifiable for all but the largest lexicons.

Mercifully, this design decision becomes crucial only in the hairiest of inflected languages, of which Russian is by no means the worst. Most languages have simpler inflection systems; for example, in English or Swahili, there are generally no more than two possible inflected forms for a given noun ("error/errors"; "kosa/makosa"), and the rules for producing these forms are fairly simple. A simpler inflection system means that design decisions are less crucial to maintaining sanity, whereas the same decisions might incur overhead-versus-scalability problems in languages like Russian. It may also be likely that code has already been written for the language in question, as with Lingua::EN::Inflect for English nouns.

Moreover, there is a third possibility simpler than anything discussed above: Just require that all possible forms be provided in the call to the given language's quant() method, as in "I found [quant,_0,file,files]." That way, quant() just has to chose which form it needs, without having to look up or generate anything. While possibly suboptimal for Russian, this should work well for most other languages, where quantification is not as complicated.

The Devil in the Details

There's plenty more to Maketext than described above-- for example, the details of how language tags interact with module naming. Module tags are the things you see in an HTTP Accept-Language header (en-US, x-cree, fi, and so on) or locale IDs like you'd see in $ENV{'LANG'} (they have underscores instead of hyphens: en_US for US English, po_BR for Brazilian Portuguese). There are the details of how to stipulate what character encodings Maketext will return text in (UTF8? Latin-1? KOI8?). There's the interesting fact that Maketext is for localization, but nowhere actually has a use locale in it. For the curious, there are the somewhat frightening details of how I implement something like data inheritance so that searches across modules' %Lexicon hashes can parallel how Perl implements method inheritance.

And, most importantly, there are all the practical details of how to go about using Maketext for your interfaces, and the various tools and conventions for starting out and maintaining individual language modules.

That is all covered in the documentation for Locale::Maketext and the modules that come with it, available in CPAN. After having read this article, which covers the "why" of Maketext, the documentation, which covers the "how" of it, should be quite straightforward.

But to give just a taste of it, here is the outline of code for English and French in a mythical application called BogoQuery. Here's the BogoQuery/L10N.pm file:

If you wanted any new methods accessible to all of your lexicons, they'd go here. Otherwise, just inherit from Locale::Maketext, which provides some sane defaults.

...methods specific to English go here. For example, use Lingua::EN::Inflect, and call it in a new 'quant' method that could automatically figure out that the plural of 'directory' is 'directories'. But in lieu of that...

Adding support for new languages is now just a matter of having a translator provide the text for a new BogoQuery/L10N/zh.pm (zh for Chinese), it.pm (it for Italian), and so on.

Because of Russian's complicated handling of numbers, BogoQuery/L10N/ru.pm would have to provide a quant() method of its own, but that wouldn't require any change to the other modules. The same is true for Arabic, since its quant() method would deal with the singular/dual/plural distinction in the language.

Chinese, which was so problematic for gettext, is easy with Maketext, with a %Lexicon entry like this:

(using English words in place of the actual Chinese text, just for the sake of this article). Incidentally, the quant() method in Chinese wouldn't need to do anything more than put a number in front of the noun, since there's no grammatical pluralization in Chinese.

The case of Italian requiring "I didn't scan any directories'' instead of "I scanned 0 directories''--well, that's the one case so far that can't be treated via our shorthand notation. It requires actual Perl code:

However, such cases are relatively rare. Most phrases can be translated either as fixed strings, or fixed strings with a few bracket shorthand bits, meaning that the translators can focus on the translating.

Proof in the Pudding: Localizing Web Sites

Maketext and gettext have a notable difference aside from their approach to languages: gettext is in C, accessible through C library calls, whereas Maketext is in Perl, and can't work without a Perl interpreter. Unlucky accidents of history have made C++ the most common language for the implementation of applications like word processors, web browsers, and even many in-house applications like custom query systems. Current conditions make it somewhat unlikely that the next one of any of these kinds of applications will be written in Perl, albeit more for reasons of inertia than what tool is best for the job.

However, other accidents of history have made Perl a well-accepted language for design of server-side programs (often CGI programs) for web site interfaces. Localization of static pages in web sites is trivial, either with simple language-negotiation features in servers like Apache, or with some kind of server-side inclusions of language-appropriate text into layout templates. However, the localization of Perl-based search systems (or other kinds of dynamic content) in web sites, be they public or access-restricted, is where Maketext will see the greatest use.

The ever-increasing internationalization of the web makes it increasingly likely that the interface to the average dynamic content service will be localized for two or maybe three languages. It is my hope that Maketext will make that task as simple as possible, and will remove previous barriers to localization for languages dissimilar to English.

Sean M. Burke (sburke@cpan.org) has a Master's in linguistics from Northwestern University; he specializes in language technology. Jordan Lachler is a PhD student in the Department of Linguistics at the University of New Mexico; he specializes in morphology and the pedagogy of North American native languages.

Forbes, Nevill. Russian Grammar. Third Edition, revised by J. C. Dumbreck. Oxford University Press, 1964.