Time-stamp: "1999-04-01 01:12:40 MST"

[To the editors: do you think I should have a "Optimization: The
Schwartzian Transform" section after the Memoization section?  I think
it might be may too long to stick in here.  What do you think?  Maybe
it should be an article of its own, but I'm not sure it's quite that
long.  Altho maybe if it were an article demonstrating the use of the
debugger and its x EXPR command, which does that neat Datadumper-like
indented printing of ref-rich structures...  But then I'd have to
actually learn to use the debugger.  Oy.  Tell me what you think.]


=head1 International Sorting with Perl's C<sort>

 -- Sean M. Burke

I remember one time in my first semester of Spanish class in high
school, I went to look up an unfamiliar word, "chaleco", in a
Spanish-English dictionary.  I looked under "C", and found that the
dictionary went right from "cetro" to "cía".  Someone had expurgated
all the "ch" words!  I had a brief nightmarish vision of a world
without chorizo, chimichangas, chicharrones, or churros.  After some
frantic page-turning, I discovered that the "ch" words were in a
separate section, "Ch", between "C" and "D".  I asked the teacher
about this, and he explained that it was normal practice for Spanish
alphabetical order to consider "Ch" a letter after "C".  But it seemed
ludicrous to me -- two letters that counted as one.  C<How pointlessly
complicated!,> I thought.  I<Why not just keep it simple, A to Z, like
normal?  Like English.>

I later learned that I<every> language has its own particular idea of
what "alphabetical order" means; the fact that English's conception of
it seems so "normal" is partly because English doesn't use any accents
and partly just because of accidents of history.

But many other languages use accented characters that have to be
sorted somehow in with the 26 letters of the "normal" A-thru-Z
alphabet.  And with other languages, some combinations of characters,
like the "ch" in Spanish, count as letters on their own.  But in
almost every case, if you want to sort according to the conventions of
a particular language, the default behavior of Perl's C<sort> won't
sort that way.  This article is about how to get Perl to sort
according to the conventions of whatever language you have in mind --
even if it's English!

=head2 Default C<sort> Versus "Normal" English Sorting

Let's say you want to sort a list of words (well, word-or-phrases) in
what you think of as normal English alphabetical order.
So you try using normal C<sort>:

 @stuff = ("canine", "cantaloupe",
           "cant",  # as in an underworld jargon
           "Canberra", "can't", "Cantonese", "cannery",
           "Cannery Row", "canonicity",
           "Cañon de Chelly"  # In north-eastern Arizona,
              # also spelled "Canyon de Chelly" and
              # "Cañon de Chelle"
          );
 @sorted = sort @stuff; # the sorting happens here
 print map "[$_] ", @sorted;

That prints:

 [Canberra] [Cannery Row] [Cantonese] [Cañon de Chelly] [can't]
  [canine] [cannery] [canonicity] [cant] [cantaloupe]

Whoa, all the capitals are sorting first.  That's because C<sort>'s
default behavior (i.e., what you get without a "sort criterion" or
without saying "use locale") is ASCIIbetical sorting -- where the
sorting is based on ASCII order.  Since "C" comes before "c" in ASCII,
all the "C" items in C<@stuff> (like "Cantonese") get sorted before
all the "c" items (like "cantaloupe").

So you happen to remember an idiom for case-insensitive sorting, and
you change the line that sets C<@sorted> to:

 @sorted = sort { lc($a) cmp lc($b) } @stuff;

and then you rerun the code.  It prints:

 [can't] [Canberra] [canine] [cannery] [Cannery Row] [canonicity]
  [cant] [cantaloupe] [Cantonese] [Cañon de Chelly]

This is closer to being in alphabetic order.  What you actually
consider proper sorting would look like this:

 [Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly]
  [canonicity] [cant] [can't] [cantaloupe] [Cantonese] 

So what's out of place so far is "can't" and "Cañon de Chelly".  Now,
"can't" is out of place because C<sort> plus your criterion C<{ lc($a)
cmp lc($b) }> considers "can't" a five character string that does
indeed sort before anything else in C<@stuff>.  Consider this code:

 print( "can't" cmp "canal" );

That prints "-1", meaning that C<cmp> says that "can't" comes before
"canal".  This is because C<cmp> is doing simple ASCIIbetical
comparison, and when it compares "can't" to "canal" it gets as far as
comparing "can'" to "cana", at which it sees that the apostrophe
character in the first one "comes before" the "a" character in the
second one -- because apostrophe is ASCII 39, and "a" is ASCII 97.

Now, this is also why "Cañon de Chelly" is coming last: because "ñ" is
a character after "n".  For sake of argument, I'll assume that you
are, like me, using Latin-1 (as opposed to UTF8, notably), so I can
say that "ñ" is a single byte, code 241.  (If you're using MacPerl and
therefore probably doing this in MacASCII, it's a different code, but
it's still one byte, with a value over 127, so my point stands.)

So what you want is to sort this list according to your idea of
English alphabetical order -- ignoring apostrophes, treating "ñ" and
"n" as the same letter, and of course ignoring case.

What you need is a function we can call like:

 @sorted = sort { &normalize($a) cmp &normalize($b) } @stuff;

where your C<normalize> function C<lc>s things, turns "ñ"s to "n"s,
and removes apostrophes (in no particular order).  That function could
consist of:

 sub normalize {
   my $in = $_[0];
   $in =~ tr/Ññ/Nn/;
   $in =~ tr/'//d; # d for delete
   return lc($in);
 }

Paste that into our original code, and run it, and it'll say:

 [Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly]
  [canonicity] [can't] [cant] [cantaloupe] [Cantonese]

or, lined up vertically just for better perusal:

 [Canberra]
 [canine]
 [cannery]
 [Cannery Row]
 [Cañon de Chelly]
 [canonicity]
 [can't]
 [cant]
 [cantaloupe]
 [Cantonese]

And that's basically right.  Now, the only peculiarity there is "cant"
versus "can't".  It so happens that when you feed both into your
C<normalize> function, you get "cant".  So when your sort criterion
compares them using C<{ &normalize($a) cmp &normalize($b) }>, it's
performing C<"cant" cmp "cant">, which returns 0, meaning that these
two "sort identically".  But since your use of C<sort> does produce a
list were "can't" either comes before or after "cant", having your
sort criterion return a C<0> means that you I<don't care> which of the
two items comes first in the output.  Under Perl's current
implementation of C<sort>, that means it's basically unpredictable
which ends up coming first.  Personally, I don't want my sort
criterion to ever be unpredictable, so I add something to the sort
criterion that kicks in to avoid returning 0 when comparing different
strings:

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or $a cmp $b
                } @stuff;

In other words, when C<&normalize($a) cmp &normalize($b)> evaluates to
0, the routine falls thru to returning the value of C<$a cmp $b>.
That at least makes this a totally predictable sort criterion, since 
C<$a cmp $b> never returns 0 for different strings.

However, if you wanted something smarter than just C<$a cmp $b>, you
could use some second function C<normalize2> that could be a bit more
fine-grain than C<normalize>, maybe implementing the idea that, in
case of a tie in C<normalize>, words with apostrophes (like "can't")
should always come after words without them (like "cant"), or that
"Chile" (the country) should always be always after "chile" (the hot
sauce), and so on.  You'd call that second criterion in your sort
criterion as:

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or &normalize2($a) cmp &normalize2($b)
                } @stuff;

or even, to be really thorough:

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or &normalize2($a) cmp &normalize2($b)
                  or $a cmp $b
                } @stuff;

Incidentally, falling back on a second comparison as a sort of
"tie-breaker" in a sort criterion is basically what people mean when
they refer to "bi-level sorting".  We'll return to this idea later.

=head2 Locale-based Sorting

The idea of being able to sort things according to the conventions of
other languages is not a new one.  C<perllocale> describes how to take
advantage of locales built into many OSes.  Ideally, you'd set the
locale to the language that sorts the way you want to sort, and then
your calls to C<sort> or C<cmp> do the right thing.  So if I set my
locale to C<fr_CA.ISO8859-1> (meaning "French Canadian, using
Latin-1"), "étude" will sort (correctly) with the "e"'s, instead of
after the "z"'s, which is how it'd be sorted ASCIIbetically.

But locales might not be available on all OSes.  Or if they are, as
C<perllocale> points out: "The available locales, the location in
which they are kept, and the manner in which they are installed, vary
from system to system.  Some systems provide only a few, hard-wired,
locales, and do not allow more to be added; others allow you to add
'canned' locales provided by the system supplier; still others allow
you or the system administrator to define and add arbitrary locales."

In other words, if you want to sort a list of French words according
to French sorting conventions, even if you can get a French locale to
work on one system, and even if that locale's idea of French sort
order is the same as I<your> idea of French sort order, there's still
no guarantee that your locale-based sorting will be able to use the
same locale on someone else's system.

Because of these basic problems with locales, I consider locale-based
sorting (even where available) to be fine for one-shot programs, but
these portability problems make it unacceptable for use in code that
I'd actually want to distribute.

So, in short, much of the code in this article basically duplicates
the functionality of some of the sorting you I<might> be able to get
from locales.  But first off, it's more portable than something
locale-based, and second off, it's more flexible in certain ways.

=head2 Spanish: Cana y Caña

Above, you read how to treat "ñ" as just an alternate form for "n",
which is appropriate for English.  But suppose you actually wanted to
sort a list of Spanish words, according to Spanish sorting
conventions.  In that case, you want to treat "ñ" not as an alternate
form for "n", but instead as a letter I<between> "n" and "o".  In that
case, you'd develop a sort criterion, as above, based on a function
C<normalize>, wherein you'd have to move the letters of the alphabet
around like so:

  tr<abcdefghijklmnñopqrstuvwxyz>   # map from this range...
    <abcdefghijklmnopqrstuvwxyz[>;  #    ...to this range

In other words, you want to keep "a" thru "n" as they are, and then
have "ñ" -- and this means bumping "o" thru "z" down by one character
code to make way for the "ñ".  That "[" is there just because it's the
character after "z" in ASCII.

That's one way to do it.  However, I find it a bit confusing, since
that way makes the Spanish alphabet look like some sort of strange
decoder-ring substitution cypher.  What I prefer is this:

  tr<abcdefghijklmnñopqrstuvwxyz>
    <\x01-\x1B>; # 1B is hex for decimal 27, for the 27 letters

and if I add or remove characters from the alphabet on that first
line, all I have to remember to do is change the C<1B> there to
reflect however many characters are in the alphabet of characters that
I'm starting with.  (Since I'm just going to end up feeding the output
of this C<normalize> function to C<cmp>, it doesn't matter whether I'm
mapping the alphabet to the range C<a-[> or to the range
C<\x01-\x1B>.)

The way you'd work this into your C<normalize> function is:

 sub normalize {
   my $in = $_[0];
   
   $in = lc($in);
   $in =~ tr/Ñ/ñ/; # lc probably didn't catch this
   
   $in =~ tr<abcdefghijklmnñopqrstuvwxyz>
            <\x01-\x1B>; # 1B = 27
   
   return $in;
 }

And then you can test it with:

 [...the code for the normalize function, above...]
 
 @stuff = ("cana", "Cantata", "caña", "cantina", "canoso",
           "cañonero", "capa");
 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or $a cmp $b
                } @stuff;
 print map "[$_] ", @sorted;

When run, this program returns:

 [cana] [canoso] [Cantata] [cantina] [caña] [cañonero] [capa] 

...which is right!  

But change the C<@stuff> to these:

 @stuff = ("cana", "caña", "cánula", "cantina",
           "cantó", "canto", "cantor");

and you (re)discover a problem:

 [cana] [cantina] [canto] [cantor] [cantó] [caña] [cánula] 

And that's quite wrong.  Spanish, you see, uses acute accents (like
over the "o" in "cantó") -- but "ó" isn't considered a separate letter
from "o".  This is the same problem you faced in the English data set
from the start of the article, except that here it's not "n" and "ñ"
we want to treat as alternates, but "o" and "ó" -- and, while we're at
it, "á/a", "é/e", "í/i", "ú/u", and the somewhat rare "ü".  So you
change the C<normalize> function:

 sub normalize {
   my $in = $_[0];
   
   $in = lc($in);
   $in =~ tr/Ñ/ñ/; # lc probably didn't catch this
   
   $in =~ tr<áéíóúüÁÉÍÓÚÜ>  # lc probably failed to turn É to é, etc
            <aeiouuaeiouu>;
   
   $in =~ tr<abcdefghijklmnñopqrstuvwxyz>
            <\x01-\x1B>; # 1B = 27
   
   return $in;
 }

Run this code, and you get:

 [cana] [cantina] [canto] [cantó] [cantor] [cánula] [caña] 

Which, ta-daa, is The Right Thing.

=head2 Spanish: Chorizo, Chimichangas, Chicharrones y Churros

So you've now got a sort criterion and an associated function
(C<normalize>) that together implement Spanish sorting conventions as
far as treatment of ñ, Ñ, and the accented vowels.  But recall from
the start of this article that Spanish has a letter "ch" between "c"
and "d".

So far we've been massaging all the data using character-to-character
substitution (using the C<tr> operator), so that C<cmp>'s ASCIIbetical
character-by-character comparison would do what we want.  However,
that all assumes that sorting is about I<single> characters.  But
since "ch" consists of two ASCII characters, it won't fit well into
our plan of using normal C<cmp>.  And "ch" is not entirely alone:
Spanish has one other two-character letter, the double-ell "ll", as in
"llama", "quesadilla", etc.

Now, you could break down and write a function that'd basically do the
same work as Perl's builtin C<cmp> but would take account of whatever
character-clusters like "ch" that you want to consider single
elements.  However, that would be I<very> inefficient compared to the
speed of Perl's builtin C<cmp>.  A more efficient way of doing it
consists of simply turning the clusters into single characters, so
that C<cmp> can be made to work right on them.  So if you simply turn
all occurrences of "ch" to, say, "¢" (which is presumably not to be
found in any of the items we're sorting), you can pretend that
"chimichanga" is really "¢imi¢anga" -- then you can treat "¢" as just
another strange letter, like "ñ" is.  Similarly you could turn "ll" to
"£", say.

This would look like:

 sub normalize {
   my $in = $_[0];
   
   $in = lc($in);

   $in =~ s/ch/¢/g;   # chimichanga => ¢imi¢anga
   $in =~ s/ll/£/g;   # llama => £ama

   $in =~ tr/Ñ/ñ/;
   $in =~ tr<áéíóúüÁÉÍÓÚÜ>
            <aeiouuaeiouu>;
   
   $in =~ tr<abc¢defghijkl£mnñopqrstuvwxyz>
            <\x01-\x1D>;
              # 1D = 29, for the 29 letters we now have
   
   return $in;
 }

And then you can test it with:

 [...the code for the normalize function, above...]
 
 @stuff = ("enchufe", "Enciclopedia de México", "endibia",
           "enchilada",  "encogido", "encanto");

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or $a cmp $b
                } @stuff;
 print map "[$_] ", @sorted;

And that outputs:

 [encanto] [Enciclopedia de México] [encogido] [enchilada] [enchufe] [endibia] 

which is correct.  Your C<normalize> function now correctly implements
Spanish-style sorting.

=head2 Bi-Level Sorting to the Rescue

There is a problem with our approach so far -- and it might not even
be a real problem for you, depending on what purpose you're putting
this sorted data to.

You may recall the second section of this article, where I pointed out
the problem of controlling the sorting of items that a given
C<normalize> function returns the same value for, like "cant" and
"can't" for an English C<normalize>, or "canto" and "cantó" for a
Spanish C<normalize>, or "Chile" and "chile" with either.

So far we've been sort of cheating and just saying:

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or $a cmp $b
                } @stuff;

but this has worked because the last bit, C<$a cmp $b>, just happens
to have correctly (so far) resolved ties that arise with
C<&normalize($a) cmp &normalize($b)>.  However, this has mostly been
just dumb luck.  And if, like many dictionaries, you want "Chile" to
come right I<after> "chile", then plain old C<cmp> as a tiebreaker
does the wrong thing.  So we do need a C<normalize2> function, as a
better tiebreaker, for "bi-level sorting":

 @sorted = sort { &normalize($a) cmp &normalize($b)
                  or &normalize2($a) cmp &normalize2($b)
                  or $a cmp $b
                } @stuff;

So let's actually implement a C<normalize2> function that'll correctly
break ties that can arise with C<normalize>.  Let's continue with
Spanish; and let's suppose, for the sake of argument, that given a tie
between variants of the letter "e", the order they should come out in
is:

  e E é É

Now, you could use the same sort of code as in C<normalize>, this
time implementing an alphabet consisting of

  a A á Á b B c C ch Ch CH d D e E é É ... and so on

However, consider that C<normalize2> is I<just> a tie-breaker -- it
doesn't need to distinguish "a" from "b" -- it'd never be called in a
case where an "a" in one position would need to be compared to a "b"
in another, since that would I<not> have resulted in a tie between
C<normalize>'d strings.  In other words, all C<normalize2> needs to be
able to do is distinguish letters that C<normalize> obliterated the
difference between -- letters in the same "family".  In other words
(grouping these letters into families) you need only map...

  from these...  a A á Á   b B   c C   ch Ch CH   d D   e E é É  ...
        onto     | | | |   | |   | |   |  |  |    | |   | | | |
      ...these.  1 2 3 4   1 2   1 2   1  2  3    1 2   1 2 3 4  ...

And you can implement that this way:

 sub normalize2 {
   my $in = $_[0];
   
   # digraph things...
   $in =~ s/ch/¢/g;   # chimichanga => ¢imi¢anga
   $in =~ s/Ch/½/g;   # Chimichanga => ½imi¢anga
   $in =~ s/CH/¼/g;   # CHIMICHANGA => ¼IMI¼ANGA
   $in =~ s/ll/£/g;   # llama => £ama
   $in =~ s/Ll/§/g;   # Llama => §ama
   $in =~ s/LL/¶/g;   # LLAMA => ¶AMA

   # now the big whammy...
   $in =~
    tr<aAbBcC¢½¼dDeEéÉfFgGhHiIíÍjJkKlL£§¶mMnNoOóÓpPqQrRsStTuUúÚüÚvVwWxXyYzZ>
      <12121212312123412121212341212121231212123412121212121234561212121212>;

   return $in;
 }

To get a better feeling for the output of this function, consider:

 &normalize2("chile")             is "1111"
 &normalize2("Chile")             is "2111"

 &normalize2("CHILES RELLENOS!")  is "32222 2232222!"

 &normalize2("cantó")             is "11113"
 &normalize2("Canto")             is "21111"
 &normalize2("CANTÓ")             is "22224"

So consider what happens when sorting "chile" and "Chile"; the sort
criterion considers the expression

   &normalize("chile") cmp &normalize("Chile")
   or &normalize2("chile") cmp &normalize2("Chile")
   or "chile" cmp "Chile"

...which simplifies to...

  "chile" cmp "chile"     # first subexpression
  or "1111" cmp "2111"    # second subexpression
  or "chile" cmp "Chile"  # last subexpression

The first C<cmp> subexpression evaluates to 0, falling thru to the
expression consisting of the two values from C<normalize2>.  Between
them, "1111" (from "chile") comes first ASCIIbetically, so C<"1111"
cmp "2111"> returns -1, to signal that "chile" should come before
"Chile".  (Perl never gets as far as evaluating the last
subexpression, C<"chile" cmp "Chile">.)

=head2 English: Résumé and Resume

Now, this whole business of bi-level sorting may all seem very
abstract and, well, foreign, if pretty much the only thing you've ever
sorted is English.  But consider if you're sorting this list of
I<English> words:

  rot   résumé   resume   rabble   return

and you want it to sort correctly, i.e., like:

  rabble   resume   résumé   return   rot

In other words, you want "resume" to sort always before "résumé".  If
you use a one-level sort, of the form:

 @sorted = sort { &normalize($a) cmp &normalize($b) } @stuff;

...then you have the choice I<either> of treating "e" and "é" as the
same letter (as with "ñ" and "n" in our Canberra/canine/cannery
example), I<or> of treating "é" as a letter after "e".  If you treat
"e" and "é" as the same letter, then the ordering of "resume" and
"résumé" would be unpredictable, since your C<normalize> will return
the same value for both of them.

But if you treat "é" as a letter after "e" (and that seems to be very
many people's first guess at a solution to this, I find), that means
that "é" will be a letter between "e" and "f", and all the "ré-" words
will come after all the "re-" words -- so that your list will sort as:

  rabble   resume   return   résumé   rot

...which is wrong.  So if you want this to sort right, you need at
least two levels in your sorting.  And since I've yet to see a case
where I<more> than two levels of sorting were needed, that pretty much
leaves you with bi-level sorting.

So, like it or not, the only way to get really correct sorting in
English is to use bi-level sorting.  And this is not just a problem
with English having foreign words like "résumé" -- the same problem
arises with wanting to sort "Bath" and "bath", say.

=head2 Optimizing with Memoization

As Perl evalues a sort criterion while sorting a list, it will ask
that criterion to compare several of the items against eachother.

To see it at work, you can run:

 @stuff = sort { print "$a & $b ; "; $a cmp $b } qw(A B C D E F);

and you'll see something along the lines of:

 A & D ; B & D ; C & D ; D & F ; D & E ; E & F ; A & B ; B & C ; 

If your criterion, like most of the ones in this article, will have to
call C<normal> (and maybe C<normal2>) for whatever items they're asked
to compare, then you can see that you're going to be calling
C<&normalize("D")> several times.  There's no point in I<re>-computing
it, since C<&normalize("D")> always gives the same answer, so all the
calls after the first is just wasted effort.  To make your sort
criterion more efficient, you can cache the results of the function
calls.  Caching the results of a function like this is commonly called
"memoization".

In other words, instead of evaluating the expression
C<&normalize($a)>, you look to see if it's in your memoization cache,
and return it if it is, and otherwise actually compute the value and
stow it in the cache for next time.  So wherever you have:

 &function($INPUT)

you would use

 exists($cache{$INPUT}) ? $cache{$INPUT}
                        : ($cache{$INPUT} = &function($INPUT))

Worked into our bi-level sort criterion, this would look like:

 {
   my(%cache, %cache2);
   @sorted = sort {
     ( exists($cache{$a}) ? $cache{$a} : ($cache{$a} = &normalize($a)) )
       cmp
     ( exists($cache{$b}) ? $cache{$b} : ($cache{$b} = &normalize($b)) )
      or
     ( exists($cache2{$a}) ? $cache2{$a} : ($cache2{$a} = &normalize2($a)) )
       cmp
     ( exists($cache2{$b}) ? $cache2{$b} : ($cache2{$b} = &normalize2($b)) )
      or
       $a cmp $b
    } @stuff;
 }

It's not pretty, but it does avoid having to needlessly recompute
C<normalize(ITEM)> several times for each item being sorted.

And the only thing better than correct sorting is I<faster> correct
sorting.

=head2 Sorting it All Out

Now, I've heard that in the years since I took my last Spanish class,
the Spanish Academy has decided to stop giving "ch" special treatment,
so that "churro" will be, they decree, under "C", somewhere between
"cesto" and "cicatriz".  However, I don't know to what degree this has
been accepted by the average Spanish speaker, much less the people who
make the phone books, dictionaries, etc., in all the Spanish-speaking
countries.

But even if everyone's idea of Spanish sorting conventions suddenly
gets simpler (i.e., by doing away with those "ch" and "ll" digraphs),
it'll still need bi-level sorting, just like English.

In fact, because implementing the kinds of bi-level sorting presented
in this article are so common a task, I've written a module,
Sort::ArbBiLex (for "arbitrary bi-level lexicographic" sorting) that
allows you to specify a sort order (possibly including multi-character
letters like Spanish "ch") for which it will build you a function that
sorts that way.  The internals of Sort::ArbBiLex (available in CPAN)
are frightening, but they're just an elaborate version of the
techniques discussed in this article, adapted to the kinds of sorting
found in most languages.

 __END__

Sean M. Burke (C<sburke@netadventure.net>) likes green chile,
churros, has a pet llama named Buñuel.