Time-stamp: "1999-04-01 01:12:40 MST" [To the editors: do you think I should have a "Optimization: The Schwartzian Transform" section after the Memoization section? I think it might be may too long to stick in here. What do you think? Maybe it should be an article of its own, but I'm not sure it's quite that long. Altho maybe if it were an article demonstrating the use of the debugger and its x EXPR command, which does that neat Datadumper-like indented printing of ref-rich structures... But then I'd have to actually learn to use the debugger. Oy. Tell me what you think.] =head1 International Sorting with Perl's C -- Sean M. Burke I remember one time in my first semester of Spanish class in high school, I went to look up an unfamiliar word, "chaleco", in a Spanish-English dictionary. I looked under "C", and found that the dictionary went right from "cetro" to "cía". Someone had expurgated all the "ch" words! I had a brief nightmarish vision of a world without chorizo, chimichangas, chicharrones, or churros. After some frantic page-turning, I discovered that the "ch" words were in a separate section, "Ch", between "C" and "D". I asked the teacher about this, and he explained that it was normal practice for Spanish alphabetical order to consider "Ch" a letter after "C". But it seemed ludicrous to me -- two letters that counted as one. C I thought. I I later learned that I language has its own particular idea of what "alphabetical order" means; the fact that English's conception of it seems so "normal" is partly because English doesn't use any accents and partly just because of accidents of history. But many other languages use accented characters that have to be sorted somehow in with the 26 letters of the "normal" A-thru-Z alphabet. And with other languages, some combinations of characters, like the "ch" in Spanish, count as letters on their own. But in almost every case, if you want to sort according to the conventions of a particular language, the default behavior of Perl's C won't sort that way. This article is about how to get Perl to sort according to the conventions of whatever language you have in mind -- even if it's English! =head2 Default C Versus "Normal" English Sorting Let's say you want to sort a list of words (well, word-or-phrases) in what you think of as normal English alphabetical order. So you try using normal C: @stuff = ("canine", "cantaloupe", "cant", # as in an underworld jargon "Canberra", "can't", "Cantonese", "cannery", "Cannery Row", "canonicity", "Cañon de Chelly" # In north-eastern Arizona, # also spelled "Canyon de Chelly" and # "Cañon de Chelle" ); @sorted = sort @stuff; # the sorting happens here print map "[$_] ", @sorted; That prints: [Canberra] [Cannery Row] [Cantonese] [Cañon de Chelly] [can't] [canine] [cannery] [canonicity] [cant] [cantaloupe] Whoa, all the capitals are sorting first. That's because C's default behavior (i.e., what you get without a "sort criterion" or without saying "use locale") is ASCIIbetical sorting -- where the sorting is based on ASCII order. Since "C" comes before "c" in ASCII, all the "C" items in C<@stuff> (like "Cantonese") get sorted before all the "c" items (like "cantaloupe"). So you happen to remember an idiom for case-insensitive sorting, and you change the line that sets C<@sorted> to: @sorted = sort { lc($a) cmp lc($b) } @stuff; and then you rerun the code. It prints: [can't] [Canberra] [canine] [cannery] [Cannery Row] [canonicity] [cant] [cantaloupe] [Cantonese] [Cañon de Chelly] This is closer to being in alphabetic order. What you actually consider proper sorting would look like this: [Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly] [canonicity] [cant] [can't] [cantaloupe] [Cantonese] So what's out of place so far is "can't" and "Cañon de Chelly". Now, "can't" is out of place because C plus your criterion C<{ lc($a) cmp lc($b) }> considers "can't" a five character string that does indeed sort before anything else in C<@stuff>. Consider this code: print( "can't" cmp "canal" ); That prints "-1", meaning that C says that "can't" comes before "canal". This is because C is doing simple ASCIIbetical comparison, and when it compares "can't" to "canal" it gets as far as comparing "can'" to "cana", at which it sees that the apostrophe character in the first one "comes before" the "a" character in the second one -- because apostrophe is ASCII 39, and "a" is ASCII 97. Now, this is also why "Cañon de Chelly" is coming last: because "ñ" is a character after "n". For sake of argument, I'll assume that you are, like me, using Latin-1 (as opposed to UTF8, notably), so I can say that "ñ" is a single byte, code 241. (If you're using MacPerl and therefore probably doing this in MacASCII, it's a different code, but it's still one byte, with a value over 127, so my point stands.) So what you want is to sort this list according to your idea of English alphabetical order -- ignoring apostrophes, treating "ñ" and "n" as the same letter, and of course ignoring case. What you need is a function we can call like: @sorted = sort { &normalize($a) cmp &normalize($b) } @stuff; where your C function Cs things, turns "ñ"s to "n"s, and removes apostrophes (in no particular order). That function could consist of: sub normalize { my $in = $_[0]; $in =~ tr/Ññ/Nn/; $in =~ tr/'//d; # d for delete return lc($in); } Paste that into our original code, and run it, and it'll say: [Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly] [canonicity] [can't] [cant] [cantaloupe] [Cantonese] or, lined up vertically just for better perusal: [Canberra] [canine] [cannery] [Cannery Row] [Cañon de Chelly] [canonicity] [can't] [cant] [cantaloupe] [Cantonese] And that's basically right. Now, the only peculiarity there is "cant" versus "can't". It so happens that when you feed both into your C function, you get "cant". So when your sort criterion compares them using C<{ &normalize($a) cmp &normalize($b) }>, it's performing C<"cant" cmp "cant">, which returns 0, meaning that these two "sort identically". But since your use of C does produce a list were "can't" either comes before or after "cant", having your sort criterion return a C<0> means that you I which of the two items comes first in the output. Under Perl's current implementation of C, that means it's basically unpredictable which ends up coming first. Personally, I don't want my sort criterion to ever be unpredictable, so I add something to the sort criterion that kicks in to avoid returning 0 when comparing different strings: @sorted = sort { &normalize($a) cmp &normalize($b) or $a cmp $b } @stuff; In other words, when C<&normalize($a) cmp &normalize($b)> evaluates to 0, the routine falls thru to returning the value of C<$a cmp $b>. That at least makes this a totally predictable sort criterion, since C<$a cmp $b> never returns 0 for different strings. However, if you wanted something smarter than just C<$a cmp $b>, you could use some second function C that could be a bit more fine-grain than C, maybe implementing the idea that, in case of a tie in C, words with apostrophes (like "can't") should always come after words without them (like "cant"), or that "Chile" (the country) should always be always after "chile" (the hot sauce), and so on. You'd call that second criterion in your sort criterion as: @sorted = sort { &normalize($a) cmp &normalize($b) or &normalize2($a) cmp &normalize2($b) } @stuff; or even, to be really thorough: @sorted = sort { &normalize($a) cmp &normalize($b) or &normalize2($a) cmp &normalize2($b) or $a cmp $b } @stuff; Incidentally, falling back on a second comparison as a sort of "tie-breaker" in a sort criterion is basically what people mean when they refer to "bi-level sorting". We'll return to this idea later. =head2 Locale-based Sorting The idea of being able to sort things according to the conventions of other languages is not a new one. C describes how to take advantage of locales built into many OSes. Ideally, you'd set the locale to the language that sorts the way you want to sort, and then your calls to C or C do the right thing. So if I set my locale to C (meaning "French Canadian, using Latin-1"), "étude" will sort (correctly) with the "e"'s, instead of after the "z"'s, which is how it'd be sorted ASCIIbetically. But locales might not be available on all OSes. Or if they are, as C points out: "The available locales, the location in which they are kept, and the manner in which they are installed, vary from system to system. Some systems provide only a few, hard-wired, locales, and do not allow more to be added; others allow you to add 'canned' locales provided by the system supplier; still others allow you or the system administrator to define and add arbitrary locales." In other words, if you want to sort a list of French words according to French sorting conventions, even if you can get a French locale to work on one system, and even if that locale's idea of French sort order is the same as I idea of French sort order, there's still no guarantee that your locale-based sorting will be able to use the same locale on someone else's system. Because of these basic problems with locales, I consider locale-based sorting (even where available) to be fine for one-shot programs, but these portability problems make it unacceptable for use in code that I'd actually want to distribute. So, in short, much of the code in this article basically duplicates the functionality of some of the sorting you I be able to get from locales. But first off, it's more portable than something locale-based, and second off, it's more flexible in certain ways. =head2 Spanish: Cana y Caña Above, you read how to treat "ñ" as just an alternate form for "n", which is appropriate for English. But suppose you actually wanted to sort a list of Spanish words, according to Spanish sorting conventions. In that case, you want to treat "ñ" not as an alternate form for "n", but instead as a letter I "n" and "o". In that case, you'd develop a sort criterion, as above, based on a function C, wherein you'd have to move the letters of the alphabet around like so: tr # map from this range... ; # ...to this range In other words, you want to keep "a" thru "n" as they are, and then have "ñ" -- and this means bumping "o" thru "z" down by one character code to make way for the "ñ". That "[" is there just because it's the character after "z" in ASCII. That's one way to do it. However, I find it a bit confusing, since that way makes the Spanish alphabet look like some sort of strange decoder-ring substitution cypher. What I prefer is this: tr <\x01-\x1B>; # 1B is hex for decimal 27, for the 27 letters and if I add or remove characters from the alphabet on that first line, all I have to remember to do is change the C<1B> there to reflect however many characters are in the alphabet of characters that I'm starting with. (Since I'm just going to end up feeding the output of this C function to C, it doesn't matter whether I'm mapping the alphabet to the range C or to the range C<\x01-\x1B>.) The way you'd work this into your C function is: sub normalize { my $in = $_[0]; $in = lc($in); $in =~ tr/Ñ/ñ/; # lc probably didn't catch this $in =~ tr <\x01-\x1B>; # 1B = 27 return $in; } And then you can test it with: [...the code for the normalize function, above...] @stuff = ("cana", "Cantata", "caña", "cantina", "canoso", "cañonero", "capa"); @sorted = sort { &normalize($a) cmp &normalize($b) or $a cmp $b } @stuff; print map "[$_] ", @sorted; When run, this program returns: [cana] [canoso] [Cantata] [cantina] [caña] [cañonero] [capa] ...which is right! But change the C<@stuff> to these: @stuff = ("cana", "caña", "cánula", "cantina", "cantó", "canto", "cantor"); and you (re)discover a problem: [cana] [cantina] [canto] [cantor] [cantó] [caña] [cánula] And that's quite wrong. Spanish, you see, uses acute accents (like over the "o" in "cantó") -- but "ó" isn't considered a separate letter from "o". This is the same problem you faced in the English data set from the start of the article, except that here it's not "n" and "ñ" we want to treat as alternates, but "o" and "ó" -- and, while we're at it, "á/a", "é/e", "í/i", "ú/u", and the somewhat rare "ü". So you change the C function: sub normalize { my $in = $_[0]; $in = lc($in); $in =~ tr/Ñ/ñ/; # lc probably didn't catch this $in =~ tr<áéíóúüÁÉÍÓÚÜ> # lc probably failed to turn É to é, etc ; $in =~ tr <\x01-\x1B>; # 1B = 27 return $in; } Run this code, and you get: [cana] [cantina] [canto] [cantó] [cantor] [cánula] [caña] Which, ta-daa, is The Right Thing. =head2 Spanish: Chorizo, Chimichangas, Chicharrones y Churros So you've now got a sort criterion and an associated function (C) that together implement Spanish sorting conventions as far as treatment of ñ, Ñ, and the accented vowels. But recall from the start of this article that Spanish has a letter "ch" between "c" and "d". So far we've been massaging all the data using character-to-character substitution (using the C operator), so that C's ASCIIbetical character-by-character comparison would do what we want. However, that all assumes that sorting is about I characters. But since "ch" consists of two ASCII characters, it won't fit well into our plan of using normal C. And "ch" is not entirely alone: Spanish has one other two-character letter, the double-ell "ll", as in "llama", "quesadilla", etc. Now, you could break down and write a function that'd basically do the same work as Perl's builtin C but would take account of whatever character-clusters like "ch" that you want to consider single elements. However, that would be I inefficient compared to the speed of Perl's builtin C. A more efficient way of doing it consists of simply turning the clusters into single characters, so that C can be made to work right on them. So if you simply turn all occurrences of "ch" to, say, "¢" (which is presumably not to be found in any of the items we're sorting), you can pretend that "chimichanga" is really "¢imi¢anga" -- then you can treat "¢" as just another strange letter, like "ñ" is. Similarly you could turn "ll" to "£", say. This would look like: sub normalize { my $in = $_[0]; $in = lc($in); $in =~ s/ch/¢/g; # chimichanga => ¢imi¢anga $in =~ s/ll/£/g; # llama => £ama $in =~ tr/Ñ/ñ/; $in =~ tr<áéíóúüÁÉÍÓÚÜ> ; $in =~ tr <\x01-\x1D>; # 1D = 29, for the 29 letters we now have return $in; } And then you can test it with: [...the code for the normalize function, above...] @stuff = ("enchufe", "Enciclopedia de México", "endibia", "enchilada", "encogido", "encanto"); @sorted = sort { &normalize($a) cmp &normalize($b) or $a cmp $b } @stuff; print map "[$_] ", @sorted; And that outputs: [encanto] [Enciclopedia de México] [encogido] [enchilada] [enchufe] [endibia] which is correct. Your C function now correctly implements Spanish-style sorting. =head2 Bi-Level Sorting to the Rescue There is a problem with our approach so far -- and it might not even be a real problem for you, depending on what purpose you're putting this sorted data to. You may recall the second section of this article, where I pointed out the problem of controlling the sorting of items that a given C function returns the same value for, like "cant" and "can't" for an English C, or "canto" and "cantó" for a Spanish C, or "Chile" and "chile" with either. So far we've been sort of cheating and just saying: @sorted = sort { &normalize($a) cmp &normalize($b) or $a cmp $b } @stuff; but this has worked because the last bit, C<$a cmp $b>, just happens to have correctly (so far) resolved ties that arise with C<&normalize($a) cmp &normalize($b)>. However, this has mostly been just dumb luck. And if, like many dictionaries, you want "Chile" to come right I "chile", then plain old C as a tiebreaker does the wrong thing. So we do need a C function, as a better tiebreaker, for "bi-level sorting": @sorted = sort { &normalize($a) cmp &normalize($b) or &normalize2($a) cmp &normalize2($b) or $a cmp $b } @stuff; So let's actually implement a C function that'll correctly break ties that can arise with C. Let's continue with Spanish; and let's suppose, for the sake of argument, that given a tie between variants of the letter "e", the order they should come out in is: e E é É Now, you could use the same sort of code as in C, this time implementing an alphabet consisting of a A á Á b B c C ch Ch CH d D e E é É ... and so on However, consider that C is I a tie-breaker -- it doesn't need to distinguish "a" from "b" -- it'd never be called in a case where an "a" in one position would need to be compared to a "b" in another, since that would I have resulted in a tie between C'd strings. In other words, all C needs to be able to do is distinguish letters that C obliterated the difference between -- letters in the same "family". In other words (grouping these letters into families) you need only map... from these... a A á Á b B c C ch Ch CH d D e E é É ... onto | | | | | | | | | | | | | | | | | ...these. 1 2 3 4 1 2 1 2 1 2 3 1 2 1 2 3 4 ... And you can implement that this way: sub normalize2 { my $in = $_[0]; # digraph things... $in =~ s/ch/¢/g; # chimichanga => ¢imi¢anga $in =~ s/Ch/½/g; # Chimichanga => ½imi¢anga $in =~ s/CH/¼/g; # CHIMICHANGA => ¼IMI¼ANGA $in =~ s/ll/£/g; # llama => £ama $in =~ s/Ll/§/g; # Llama => §ama $in =~ s/LL/¶/g; # LLAMA => ¶AMA # now the big whammy... $in =~ tr <12121212312123412121212341212121231212123412121212121234561212121212>; return $in; } To get a better feeling for the output of this function, consider: &normalize2("chile") is "1111" &normalize2("Chile") is "2111" &normalize2("CHILES RELLENOS!") is "32222 2232222!" &normalize2("cantó") is "11113" &normalize2("Canto") is "21111" &normalize2("CANTÓ") is "22224" So consider what happens when sorting "chile" and "Chile"; the sort criterion considers the expression &normalize("chile") cmp &normalize("Chile") or &normalize2("chile") cmp &normalize2("Chile") or "chile" cmp "Chile" ...which simplifies to... "chile" cmp "chile" # first subexpression or "1111" cmp "2111" # second subexpression or "chile" cmp "Chile" # last subexpression The first C subexpression evaluates to 0, falling thru to the expression consisting of the two values from C. Between them, "1111" (from "chile") comes first ASCIIbetically, so C<"1111" cmp "2111"> returns -1, to signal that "chile" should come before "Chile". (Perl never gets as far as evaluating the last subexpression, C<"chile" cmp "Chile">.) =head2 English: Résumé and Resume Now, this whole business of bi-level sorting may all seem very abstract and, well, foreign, if pretty much the only thing you've ever sorted is English. But consider if you're sorting this list of I words: rot résumé resume rabble return and you want it to sort correctly, i.e., like: rabble resume résumé return rot In other words, you want "resume" to sort always before "résumé". If you use a one-level sort, of the form: @sorted = sort { &normalize($a) cmp &normalize($b) } @stuff; ...then you have the choice I of treating "e" and "é" as the same letter (as with "ñ" and "n" in our Canberra/canine/cannery example), I of treating "é" as a letter after "e". If you treat "e" and "é" as the same letter, then the ordering of "resume" and "résumé" would be unpredictable, since your C will return the same value for both of them. But if you treat "é" as a letter after "e" (and that seems to be very many people's first guess at a solution to this, I find), that means that "é" will be a letter between "e" and "f", and all the "ré-" words will come after all the "re-" words -- so that your list will sort as: rabble resume return résumé rot ...which is wrong. So if you want this to sort right, you need at least two levels in your sorting. And since I've yet to see a case where I than two levels of sorting were needed, that pretty much leaves you with bi-level sorting. So, like it or not, the only way to get really correct sorting in English is to use bi-level sorting. And this is not just a problem with English having foreign words like "résumé" -- the same problem arises with wanting to sort "Bath" and "bath", say. =head2 Optimizing with Memoization As Perl evalues a sort criterion while sorting a list, it will ask that criterion to compare several of the items against eachother. To see it at work, you can run: @stuff = sort { print "$a & $b ; "; $a cmp $b } qw(A B C D E F); and you'll see something along the lines of: A & D ; B & D ; C & D ; D & F ; D & E ; E & F ; A & B ; B & C ; If your criterion, like most of the ones in this article, will have to call C (and maybe C) for whatever items they're asked to compare, then you can see that you're going to be calling C<&normalize("D")> several times. There's no point in I-computing it, since C<&normalize("D")> always gives the same answer, so all the calls after the first is just wasted effort. To make your sort criterion more efficient, you can cache the results of the function calls. Caching the results of a function like this is commonly called "memoization". In other words, instead of evaluating the expression C<&normalize($a)>, you look to see if it's in your memoization cache, and return it if it is, and otherwise actually compute the value and stow it in the cache for next time. So wherever you have: &function($INPUT) you would use exists($cache{$INPUT}) ? $cache{$INPUT} : ($cache{$INPUT} = &function($INPUT)) Worked into our bi-level sort criterion, this would look like: { my(%cache, %cache2); @sorted = sort { ( exists($cache{$a}) ? $cache{$a} : ($cache{$a} = &normalize($a)) ) cmp ( exists($cache{$b}) ? $cache{$b} : ($cache{$b} = &normalize($b)) ) or ( exists($cache2{$a}) ? $cache2{$a} : ($cache2{$a} = &normalize2($a)) ) cmp ( exists($cache2{$b}) ? $cache2{$b} : ($cache2{$b} = &normalize2($b)) ) or $a cmp $b } @stuff; } It's not pretty, but it does avoid having to needlessly recompute C several times for each item being sorted. And the only thing better than correct sorting is I correct sorting. =head2 Sorting it All Out Now, I've heard that in the years since I took my last Spanish class, the Spanish Academy has decided to stop giving "ch" special treatment, so that "churro" will be, they decree, under "C", somewhere between "cesto" and "cicatriz". However, I don't know to what degree this has been accepted by the average Spanish speaker, much less the people who make the phone books, dictionaries, etc., in all the Spanish-speaking countries. But even if everyone's idea of Spanish sorting conventions suddenly gets simpler (i.e., by doing away with those "ch" and "ll" digraphs), it'll still need bi-level sorting, just like English. In fact, because implementing the kinds of bi-level sorting presented in this article are so common a task, I've written a module, Sort::ArbBiLex (for "arbitrary bi-level lexicographic" sorting) that allows you to specify a sort order (possibly including multi-character letters like Spanish "ch") for which it will build you a function that sorts that way. The internals of Sort::ArbBiLex (available in CPAN) are frightening, but they're just an elaborate version of the techniques discussed in this article, adapted to the kinds of sorting found in most languages. __END__ Sean M. Burke (C) likes green chile, churros, has a pet llama named Buñuel.