International Sorting with Perl's sort

Sean M. Burke

Issue 14, Summer 1999
Packages Used

Sort::ArbBiLex, Memoize

In my first semester of Spanish class in high school, I went to look up an unfamiliar word, "chaleco", in a Spanish-English dictionary. I looked under "C", and found that the dictionary went right from "cetro" to "ca". Someone had expurgated all the "ch" words! I had a brief nightmarish vision of a world without chorizo, chimichangas, chicharrones, or churros. After some frantic page-turning, I discovered that the "ch" words were in a separate section, "Ch", between "C" and "D". I asked the teacher about this, and he explained that it was normal practice for Spanish alphabetical order to consider "Ch" a letter after "C". But it seemed ludicrous to me -- two letters that counted as one. "How pointlessly complicated!" I thought. "Why not just keep it simple, A to Z, like normal? Like English."

I later learned that every language has its own particular idea of what "alphabetical order" means; the fact that English's conception of it seems so "normal" is partly because English doesn't use any accents and partly because of accidents of history.

But many other languages use accented characters that have to be sorted with the 26 letters of the "normal" A-through-Z alphabet. And with other languages, some combinations of characters like the "ch" in Spanish, count as letters on their own. But in almost every case, if you want to sort according to the conventions of a particular language, the default behavior of Perl's sort won't sort that way. This article is about how to get Perl to sort according to the conventions of whatever language you have in mind -- even if it's English!

Default sort Versus "Normal" English Sorting

Let's say you want to sort a list of words (or phrases) in what you think of as normal English alphabetical order. So you try using normal sort:

@stuff = ("canine", "cantaloupe",
          "cant",                  # as in an underworld jargon
          "Canberra", "can't", 
          "Cantonese", "cannery",
          "Cannery Row", "canonicity",
          "Caon de Chelly"  # In north-eastern Arizona,
          # also spelled "Canyon de Chelly" and
          # "Caon de Chelle"
          );
@sorted = sort @stuff;              # the sorting happens here
print map "[$_] ", @sorted;

That prints:

[Canberra] [Cannery Row] [Cantonese] [Caon de Chelly] 
[can't] [canine] [cannery] [canonicity] [cant] [cantaloupe]

Whoa. All the capitals are sorting first. That's because sort's default behavior (what you get without a "sort criterion" or without use locale, both of which we'll discuss later) is ASCIIbetical sorting -- where the sorting is based on ASCII order. Since "C" comes before "c" in ASCII, all the "C" items in @stuff (like "Cantonese") get sorted before all the "c" items (like "cantaloupe").

So you happen to remember an idiom for case-insensitive sorting, and you change the line that sets @sorted to:

@sorted = sort { lc($a) cmp lc($b) } @stuff;

and then you rerun the code. It prints:

[can't] [Canberra] [canine] [cannery] [Cannery Row]
[canonicity] [cant] [cantaloupe] [Cantonese] [Caon de Chelly]

Closer. What you actually consider proper sorting looks like this:

[Canberra] [canine] [cannery] [Cannery Row] [Caon de Chelly] 
[canonicity] [cant] [can't] [cantaloupe] [Cantonese]

The phrases "can't" and "Caon de Chelly" are out of place. "can't" is out of place because { lc($a) cmp lc($b) } treats "can't" as a five character string that sorts before anything else in @stuff. Consider this code:

print ( "can't" cmp "canal" );

That prints -1, meaning that "can't" comes before "canal". This is because cmp is doing simple ASCIIbetical comparison, and when it compares "can't" to "canal" it gets as far as comparing "can'" to "cana". At that point it sees that the apostrophe character comes before a, because the apostrophe is ASCII 39, and "a" is ASCII 97.

Now, this is also why "Caon de Chelly" is coming last: because "" is a character after "n". For sake of argument, I'll assume that you are, like me, using Latin-1 as opposed to UTF8, so I can say that "" is a single byte: byte 241, in particular. (If you're using MacPerl and therefore probably using MacASCII, it's a different code, but it's still one byte with a value over 127, so my point stands. If you don't know what encoding you're using, you're probably using Latin-1.)

So what you want is to sort this list according to your idea of English alphabetical order -- ignoring apostrophes, treating "" and "n" as the same letter, and of course ignoring case.

What you need is a subroutine we can call like this:

@sorted = sort { normalize($a) cmp normalize($b) } @stuff;

where your normalize() subroutine lcs things, turns ""s to "n"s, and removes apostrophes (in no particular order). That function could consist of:

sub normalize {
    my $in = $_[0];
    $in =~ tr//Nn/;
    $in =~ tr/'//d; # d for delete
    return lc($in);
}

Paste that into our original code, and run it, and it'll display this (lined up vertically just for better perusal):

    [Canberra]
    [canine] 
    [cannery] 
    [Cannery Row] 
    [Caon de Chelly] 
    [canonicity] 
    [can't] 
    [cant] 
    [cantaloupe] 
    [Cantonese] 

And that's basically right. Now, the only peculiarity there is "cant" versus "can't". It so happens that when you feed both into your normalize() subroutine, you get "cant". So when your sort criterion compares them using { normalize($a) cmp normalize($b) }, it's performing "cant" cmp "cant", which returns 0, meaning that these two sort identically. But since your use of sort produces a list were "can't" either comes before or after "cant", having your sort criterion return a 0 means that you don't care which of the two items comes first in the output, which effectively means that you can't predict which will end up first. Personally, I don't want my sort criterion to ever be unpredictable, so I add something that kicks in to avoid returning 0 when comparing different strings:

@sorted = sort { normalize($a) cmp normalize($b)
                 or $a cmp $b
               } @stuff;

In other words, when normalize($a) cmp normalize($b) evaluates to 0, the routine falls through to returning the value of $a cmp $b. That makes this a completely predictable sort criterion, since $a cmp $b never returns 0 for different strings.

However, if you wanted something smarter than just $a cmp $b, you could use some second subroutine, normalize2(), that could be a bit more fine-grained than normalize(). Maybe it would implement the idea that, in case of a tie in normalize(), words with apostrophes (like "can't") should always come after words without them (like "cant"), or that "Chile" (the country) should always be always after "chile" (the hot sauce), and so on. You'd call that second part of your sort criterion as:

@sorted = sort { normalize($a) cmp normalize($b)
                 or normalize2($a) cmp normalize2($b)
               } @stuff;

To be really thorough, you could add in a cmp at the end:

@sorted = sort { normalize($a) cmp normalize($b)
                 or normalize2($a) cmp normalize2($b)
                 or $a cmp $b
                } @stuff;

Incidentally, falling back on a second comparison as a sort of "tie-breaker" in a sort criterion is basically what people mean when they refer to "bi-level sorting". We'll return to this idea later.

Locale-based Sorting

The idea of being able to sort things according to the conventions of other languages is not a new one. The perllocale documentation bundled with Perl describes how to take advantage of locales built into many OSes. Ideally, you'd set the locale to the language that sorts the way you want to sort, and then your calls to sort or cmp do the right thing. So if I set my locale to fr_CA.ISO8859-1 (meaning "French Canadian, using Latin-1"), "tude" will sort (correctly) with the "e"'s, instead of after the "z"'s, which is how it'd be sorted ASCIIbetically.

But locales might not be available on all computers. As perllocale points out: "The available locales, the location in which they are kept, and the manner in which they are installed, vary from system to system. Some systems provide only a few, hard-wired, locales, and do not allow more to be added; others allow you to add 'canned' locales provided by the system supplier; still others allow you or the system administrator to define and add arbitrary locales."

In other words, if you want to sort a list of French words according to French sorting conventions, even if you can get a French locale to work on one system, and even if that locale's idea of French sort order is the same as your idea of French sort order, there's still no guarantee that your locale-based sorting will be able to use the same locale on someone else's system. Because of these basic problems with locales, I consider locale-based sorting (even where available) to be fine for one-shot programs, but these portability problems make it unacceptable for use in code that I'd actually want to distribute.

So, in short, much of the code in this article basically duplicates the functionality of some of the sorting you might be able to get from locales, but in a more portable and flexible way.

Spanish: Cana y Caa

Earlier, we saw how to treat "" as just an alternate form for "n", which is appropriate for English. But suppose you actually wanted to sort a list of Spanish words according to Spanish sorting conventions. In that case, you want to treat "" not as an alternate form for "n", but instead as a letter between "n" and "o". In that case, you'd develop a sort criterion, as above, based on a normalize() subroutine, wherein you'd have to move the letters of the alphabet around like so:

tr<abcdefghijklmnopqrstuvwxyz>  # map from this ...
  <abcdefghijklmnopqrstuvwxyz[>; #    ...to this

In other words, you want to keep "a" thru "n" as they are, and then have "" -- and this means bumping "o" thru "z" down by one character code to make way for the "". That "[" is there just because it's the character after "z" in ASCII.

That's one way to do it. However, I find it a bit confusing, since that way makes the Spanish alphabet look like a strange decoder-ring substitution cypher. What I prefer is this:

tr<abcdefghijklmnopqrstuvwxyz>
  <\x01-\x1B>; 
  # 1B is hex for decimal 27, for the 27 letters

If I add or remove characters from the alphabet on that first line, all I have to remember to do is change the 1B there to reflect however many characters are in the alphabet of characters that I'm starting with. (Since I'm just going to end up feeding the output of this normalize subroutine to cmp, it doesn't matter whether I'm mapping the alphabet to the range a-[ or to the range \x01-\x1B.)

Here's how you'd work this into your normalize subroutine:

sub normalize {
   my $in = $_[0];
   $in = lc($in);
   $in =~ tr///; # lc probably didn't catch this
   $in =~ tr<abcdefghijklmnopqrstuvwxyz>
   <\x01-\x1B>; # 1B = 27
   return $in;
}

Then you can test it with this:

[...the code for normalize(), above...]
@stuff = ("cana", "Cantata", "caa", "cantina", 
                 "canoso", "caonero", "capa");
@sorted = sort { normalize($a) cmp normalize($b)
                 or $a cmp $b
} @stuff;
print map "[$_] ", @sorted;

When run, this program returns:

[cana] [canoso] [Cantata] [cantina] [caa] 
[caonero] [capa]

...which is right! But change @stuff to these:

@stuff = ("cana", "caa", "cnula", "cantina", 
          "cant", "canto", "cantor");

and you (re)discover a problem:

[cana] [cantina] [canto] [cantor] [cant] [caa] [cnula]

And that's quite wrong. Spanish, you see, uses acute accents (like over the "o" in "cant") -- but "" isn't considered a separate letter from "o". This is the same problem you faced in the English data set from the start of the article, except that here it's not "n" and "" we want to treat as alternates, but "o" and "" -- and, while we're at it, "/a", "/e", "/i", "/u", and the somewhat rare "". So you change the normalize subroutine:

sub normalize {
  my $in = $_[0]; 
  
  $in = lc($in);
  # lc probably didn't catch this
  $in =~ tr///; 
  # lc probably failed to turn  to , etc 
  $in =~ tr<>  
              <aeiouuaeiouu>;
  $in =~ tr<abcdefghijklmnopqrstuvwxyz>
              <\x01-\x1B>; # 1B = 27
  return $in;
}

Run this code, and you get:

[cana] [cantina] [canto] [cant] [cantor] [cnula] [caa]

Which (ta-daa!) is The Right Thing.

Spanish: Chorizo, Chimichangas, Chicharrones y Churros

So you've now got a sort criterion and an associated subroutine (normalize) that together implement Spanish sorting conventions as far as treatment of , , and the accented vowels. But recall from the start of this article that Spanish has a letter "ch" between "c" and "d".

So far we've been massaging all the data using character-to-character substitution (using the tr operator), so that cmp's ASCIIbetical character-by-character comparison would do what we want. However, that all assumes that sorting is about single characters. But since "ch" consists of two ASCII characters, it won't fit well into our plan of using normal cmp. And "ch" is not alone: Spanish has one other two-character letter, the double-ell "ll", as in "llama", "quesadilla", and so on. Now, you could break down and write a subroutine that basically does the same work as Perl's builtin cmp, but considers character-clusters like "ch" that you want to treat as single elements. However, that would be very inefficient compared to the speed of Perl's builtin cmp. A more efficient way of doing it consists of simply turning the clusters into single characters, so that cmp can be made to work right on them. So if you simply turn all occurrences of "ch" to, say, "" (which is presumably not to be found in any of the items we're sorting), you can pretend that "chimichanga" is really "imianga" -- then you can treat "" as just another strange letter, like "" is. Similarly you could turn "ll" to "", say.

This would look like:

sub normalize {
  my $in = $_[0];
  $in = lc($in);
  $in =~ s/ch//g;   # chimichanga => imianga
  $in =~ s/ll//g;   # llama => ama
  $in =~ tr///;
  $in =~ tr<>
               <aeiouuaeiouu>;
  $in =~ tr<abcdefghijklmnopqrstuvwxyz>
               <\x01-\x1D>;
  # 1D = 29, for the 29 letters we now have
  
  return $in;
}

And then you can test it with:

[...the code for normalize() above...]
@stuff = ("enchufe", "Enciclopedia de Mxico", "endibia",
          "enchilada",  "encogido", "encanto");
@sorted = sort { normalize($a) cmp normalize($b)
          or $a cmp $b
          } @stuff;
print map "[$_] ", @sorted;

The output, which is correct:

[encanto] [Enciclopedia de Mxico] [encogido] 
[enchilada] [enchufe] [endibia]

Your normalize() subroutine now correctly implements Spanish-style sorting.

Bi-Level Sorting to the Rescue

There is a problem with our approach so far -- and it might not even be a real problem for you, depending on why you're sorting your data. Earlier, I talked about what happens when a sorting subroutine returns the same value for a pair of items, like "cant" and "can't" for an English normalize(), or "canto" and "cant" for a Spanish normalize(), or "Chile" and "chile" with either. So far we've been sort of cheating, with criteria like these:

@sorted = sort { normalize($a) cmp normalize($b)
          or $a cmp $b
          } @stuff;

This worked because the last expression, $a cmp $b, just happens to have correctly resolved ties that arise with normalize($a) cmp normalize($b). That was just dumb luck. And if, like many dictionaries, you want "Chile" to come after "chile", then plain old cmp as a tiebreaker does the wrong thing. So we need bi-level sorting with a normalize2() function as a better tiebreaker:

@sorted = sort { normalize($a) cmp normalize($b)
          or normalize2($a) cmp normalize2($b)
          or $a cmp $b 
          } @stuff;

So let's implement a normalize2() subroutine that correctly breaks normalize() ties. Let's continue with Spanish, and let's suppose that given a tie between variants of the letter "e", the order they should come out in is:

e E  

Now, you could use the same sort of code as in normalize(), this time implementing an alphabet consisting of

a A   b B c C ch Ch CH d D e E   ...

However, consider that normalize2() is just a tie-breaker -- it doesn't need to distinguish "a" from "b" -- it'd never be called in a case where an "a" in one position would need to be compared to a "b" in another, since that would not have resulted in a tie between normalized strings. In other words, all normalize2() needs to do is distinguish letters that normalize() obliterated the difference between -- letters in the same "family". In other words (grouping these letters into families) you need only map from these:

a A      b B    c C    ch Ch CH    d D    e E    ...

Onto these:

1 2 3 4    1 2    1 2    1  2  3     1 2    1 2 3 4  ...

And you can implement that this way:

sub normalize2 {
  my $in = $_[0];
  # digraph things...
  $in =~ s/ch//g;   # chimichanga => imianga
  $in =~ s/Ch/*/g;   # Chimichanga => *imianga
  $in =~ s/CH/*/g;   # CHIMICHANGA => *IMI*ANGA
  $in =~ s/ll//g;   # llama => ama
  $in =~ s/Ll//g;   # Llama => ama
  $in =~ s/LL//g;   # LLAMA => AMA
  # now the big whammy...
  $in =~tr<aAbBcC**dDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ>
          <12121212312123412121212341212121231212123412121212121234561212121212>;
  return $in;
}

To get a better feeling for the output of this function, consider:

normalize2("chile")              is "1111" 
normalize2("Chile")              is "2111"
normalize2("CHILES RELLENOS!")   is "32222 2232222!"
normalize2("cant")              is "11113"
normalize2("Canto")              is "21111"
normalize2("CANT")              is "22224"

So consider what happens when sorting "chile" and "Chile"; the sort criterion considers the expression

normalize("chile") cmp normalize("Chile")
or normalize2("chile") cmp normalize2("Chile")
or "chile" cmp "Chile"

This simplifies to:

"chile" cmp "chile"     # first subexpression
or "1111" cmp "2111"    # second subexpression
or "chile" cmp "Chile"  # last subexpression

The first cmp subexpression evaluates to 0, falling through to the expression consisting of the two values from normalize2. Between them, "1111" (from "chile") comes first ASCIIbetically, so "1111" cmp "2111" returns 1, to signal that "chile" should come before "Chile". (Perl never gets as far as evaluating the last subexpression, "chile" cmp "Chile".)

English: Rsum and Resume

Now, this whole business of bi-level sorting may all seem very abstract and, well, foreign, if the only thing you've ever sorted is English. But consider if you're sorting this list of English words:

rot    rsum    resume    rabble    return

and you want it to sort correctly:

rabble    resume    rsum    return    rot

In other words, you want "resume" to always sort before "rsum". If you use a one-level sort like this:

@sorted = sort { normalize($a) cmp normalize($b)
          } @stuff;

You have a choice. Either treat "e" and "" as the same letter (as with "" and "n" in our Canberra/canine/cannery example), or treat "" as a letter after "e". If you treat "e" and "" as the same letter, then the ordering of "resume" and "rsum" would be unpredictable, since your normalize() will return the same value for both.

But if you treat "" as a letter after "e" (and that seems to be many people's first guess at a solution, I've found), that means that "" will be a letter between "e" and "f", and all the "r-" words will come after all the "re-" words -- so that your list will sort as:

rabble    resume    return    rsum    rot

That's wrong. So if you want this to sort right, you need at least two levels in your sorting. And since I've yet to see a case where more than two levels of sorting were needed, that pretty much leaves you with bi-level sorting.

So, like it or not, the only way to get really correct sorting in English is to use bi-level sorting. And this is not just a problem with English having foreign words like "rsum" -- the same problem arises with wanting to sort "Bath" and "bath", say.

Optimizing with Memoization

As Perl evaluates a sort criterion while sorting a list, it will ask that criterion to compare several of the items against each other. To see it at work, you can run:

@stuff = sort { print "$a & $b ; "; $a cmp $b } qw(A B C D E F);

and you'll see something along the lines of:

A & D ; B & D ; C & D ; D & F ; D & E ; E & F ; A & B ; B & C ;

If your criterion, like most of the ones in this article, will have to call normalize() (and maybe normalize2()) for whatever items they're asked to compare, then you can see that you're going to be calling normalize("D") several times. There's no point in re-computing it, since normalize("D") always gives the same answer, so all the calls after the first is just wasted effort. To make your sort criterion more efficient, you can cache the results of the function calls. Caching the results of a function like this is commonly called "memoization". In other words, instead of evaluating the expression normalize($a), you look to see if you computed it earlier and saved the result. otherwise, you compute the value and stow it in the cache for next time. So wherever you have:

function($INPUT)

you would use

     exists($cache{$INPUT}) ? $cache{$INPUT}
                            : ($cache{$INPUT} = 
            function($INPUT))

Worked into our bi-level sort criterion, this would look like:

{
  my(%cache, %cache2);
  @sorted = sort {
     ( exists($cache{$a}) ? $cache{$a}
         : ($cache{$a} = normalize($a)) )
  cmp
     ( exists($cache{$b}) ? $cache{$b}
         : ($cache{$b} = normalize($b)) )
  or
     ( exists($cache2{$a}) ? $cache2{$a}
         : ($cache2{$a} = normalize2($a)) )
  cmp
     ( exists($cache2{$b}) ? $cache2{$b}
         : ($cache2{$b} = normalize2($b)) )
  or
  $a cmp $b
     } @stuff;
}

It's not pretty, but it does avoid having to needlessly recompute normalize(ITEM) several times for each item being sorted. And the only thing better than correct sorting is faster correct sorting.

(Note: I've presented memoization only in the context of sorting. For more general applications, see Mark-Jason Dominus's article on the topic in TPJ #13, or the Memoize module in CPAN.)

Sorting it All Out

Now, I've heard that in the years since I took my last Spanish class, the Spanish Academy has decided to stop giving "ch" special treatment, so that "churro" will be, they decree, under "C", somewhere between "cesto" and "cicatriz". However, I don't know to what degree this has been accepted by the average Spanish speaker, much less the people who make the phone books and dictionaries in all the Spanish-speaking countries.

But even if everyone's idea of Spanish sorting conventions suddenly gets simpler (by doing away with those "ch" and "ll" digraphs), it'll still need bi-level sorting, just like English.

In fact, because implementing the bi-level sorting presented in this article is so common, I've written a module that does it for you. Sort::ArbBiLex (for "arbitrary bi-level lexicographic" sorting) allows you to specify a sort order (possibly including multi-character letters like Spanish "ch") for which it will build a subroutine that sorts that way. The internals of Sort::ArbBiLex (available in CPAN) are frightening, but they're just an elaborate version of the techniques discussed in this article, adapted to the kinds of sorting found in most languages.

__END__


Sean M. Burke (sburke@cpan.org) likes green chile and churros, and has a pet llama named Buuel.


[Back to list of articles]