# Time-stamp: "1999-10-08 01:11:47 MDT sburke@stonehenge.netadventure.net" =head1 Searching for Rhymes with Perl =head2 Sean M. Burke La poésie doit être faite par tous. Poetry is for I to make. -- Lautréamont (Isidore Ducasse, 1846-1870) Wherever I go, people always come up to me and say "Sean, you gotta help me -- I need to find a three-syllable word that rhymes with 'toad'." And my answer is always the same; I always say "well, we're going to have te pull out the Perl for this one!" Because, while I articles constantly demonstrate that Perl is good at everything from designing sundails to peppering irc with Eliza bots, one thing that it's I good at is making short little programs for searching text. And that's what this article is about -- how to search text (specifically wordlists or pronunciation databases) for rhymes of various kinds. =head2 Where To Look If this article were about rhyming in Spanish or Italian, or Finnish, it'd be a whole lot simpler! Because, for the most part, the way something is spelled in these languages tells you pretty well how to pronounce it; ending with the same letters may not be exactly the same thing as rhyming, but often you can start with the spelling and apply some trivial string replacement operations to get a phonetic form that can be searched for the presence of a rhyme. This can work even with with French, where (for the most part) spelling tells you pronunciation, even though the pronunciation won't tell you the spelling. However, English sure isn't that kind of language -- not only does the English pronunciation of a word I tell you how to spell it, its spelling doesn't tell you how to pronounce it. But luckily, lexicons exist that tell are basically simple databases, corresponding the normal written form of a word to some representation of its pronunciation. One of my favorite lexicons (partly because it's I) is Moby Pronunciator. It consists of about 177,000 entries, one word to a line, that look like this: ... accipitrine /&/k's/I/p/I/tr/I/n Accius '/&/k/S//i//@/s acclaim /@/'kl/eI/m acclamation ,/&/kl/@/'m/eI//S//@/n acclamation_medal ,/&/kl/@/'m/eI//S//@/n_'m/E/d/-/l acclamatory /@/'kl/&/m/@/,t/oU/r/i/ acclimate /@/'kl/aI/m/I/t acclimation ,/&/kl/@/'m/eI//S//@/n acclimation_fever ,/&/kl/@/'m/eI//S//@/n_'f/i/v/@/r acclimatise /@/'kl/aI/m/@/,t/aI/z acclimatize /@/'kl/aI/m/@/,t/aI/z acclivity /@/'kl/I/v/I/t/i/ ... Without bothering just now with the values of these symbols, you can see that (as the README will tell you), the format of each line is the word (or underscore-separated multiword phrase, like "acclamation_medal"), then a space, then the phonetic notation. What the slashes mean (and why there isn't one between the /k/ and /l/ in "acclaim", etc.) is something I'm unsure of. But I am sure that these slashes are annoying, and get in the way of me trying to actually search, since I have to always remember to stick them in my search patterns, always worrying that I stuck in one too many. And the same goes for the commas and apostrophes, which indicate stress, since when I'm looking for a rhyme, I may not care about stress. =head2 Preparing the Data So the first thing to do, whether it's for the Moby Pronunciator wordlist in specific, or for any other wordlist you choose to use instead, is to strip out the parts you don't want and to take what's left and format it the way you like. Here we can just do that by deleting certain tokens in the pronunciation part: slashes (used to separate phonemes?) spaces and underscores (used to separate words) apostrophes (used to precede syllables with primary stress) commas (used to precede syllables with secondary stress) Since these tokens are all single characters, we can delete them by just applying a C operator, with the C switch ("d" for delete), to slashes, spaces, underscores, commas, and apostrophes: tr/\/ _,'//d; Personally, I find it disconcerting to have the backslash-escaped slash in there, so I tend to use different delimiters, like matching wedges, for C to do just the same thing: tr<>d; Either way, you can build this into a program that reads the Mody Pronunciator database: open(IN, ') { chomp; ($word, $pron) = split(' ', $_); $pron =~ tr<>d; ... then do something with $word and $pron ... } Now, to get to searching this database for rhymes (or any other phonetic information). There's two ways to go about it: * use the code above, and once you'd modified C<$pron>, search it for a pattern; or: * write C<$word> and the modified C<$pron> to a file, and then use grep on that file. The benefit of the former is simplicity, but the benefit of the latter is efficiency -- no need to constantly chomp, split, and C for each line. Now, normally I say that program (as opposed to programmer) efficiency is overvalued in programming; but in this case, since the Moby wordlist is so very large, and since that makes the first approach so wasteful, I say the second approach the one to take. So we can save each line's C<$word> and C<$pron> values to a file called "mpron.dat", like so: open(IN, 'mpron.dat') or die $!; while() { chomp; ($word, $pron) = split(' ', $_); $pron =~ tr<>d; print OUT $word, "\t", $pron, "\n"; # tab makes a nice delimiter } The resulting file, mpron.dat looks like: ... accipitrine &ksIpItrIn Accius &kSi@s acclaim @kleIm acclamation &kl@meIS@n acclamation_medal &kl@meIS@nmEd-l acclamatory @kl&m@toUri acclimate @klaImIt acclimation &kl@meIS@n acclimation_fever &kl@meIS@nfiv@r acclimatise @klaIm@taIz acclimatize @klaIm@taIz acclivity @klIvIti ... =head2 Searching The Prepared Data So, with that file prepared, we can grep it for a whatever pattern we want in the pronunciation. Suppose we're still after a three-syllable word that rhymes with "toad". The idea of rhyme in English is a pretty straightforward matter: if two words rhyme, this means they end in the some sounds (generally the last vowel and any consonants following it). If I were quite familiar with the phonetic notation for Moby Pronunciator (or whatever alternate pronunciation database you might use), I could, off the top of my head, say how to represent the sound "-oad" (from "toad"). However, I've never bothered, since so easy to just look up the word you want to rhyme with, and see how it's represented: % grep '^toad' mpron.dat toad toUd toad's-mouth toUdzmouT toadeater toUdit@r toadfish toUdfIS toadflax toUdfl&ks toadstone toUdstoUn toadstool toUdstul toadstool_disease toUdstuldIziz toady toUdi So C it is! % grep 'oUd$' mpron.dat abode @boUd access_road &ksEsroUd acnode &knoUd Aeolian_mode ioUli@nmoUd alamode &l@moUd Alexis_Claude AlEksikloUd all-hallowed Olh&loUd anchor_rode &Nk@rroUd ...and 281 other matches, ending with zip_code (C). But there's so many because we've not limited that to three-syllable words. So how do we do that? =head2 Counting Syllables As with most models of syllables in most languages, an English syllable is basically a vowel sound with some number of consonants before and/or after it. Now, actually settling on what consonants go with what vowels is a sticky subject (is rostrum C or , or what?), but since all we want to do now is I the syllables, we merely need to count the number of vowel sounds. You've seen that some vowel sounds, like the long "o" sound in "toad" are represented by a pair of ASCII characters, "oU". That means that we can't simply count the number of vowel characters in the pronunciation string, because then "oU" would count as two. We could count the number of times we find a I of some number of vowel characters, but that would match only once in each of these two-syllable words: eon i@n (one sequence: "i@") Noah noU@ (one sequence: "oU@") (The C<@> character here represents the "uh" sound in unstressed syllables.) However, if we go back to the format that the original Moby Pronunciator file is in (as opposed to our cooked mpron.dat file), we see that those slashes can do us some good: eon '/i//@/n Noah 'n/oU//@/ One thing that is consistent with slashes in the input file is that they are there (at least one of them) between vowels in different syllables, as above. So where the vowels in the two syllables in "Noah" in our prepared file run together, they are still separate in the original file. This means that if we start with the original form of the pronunciation entry, and then count the number of occurrences of sequences of vowel characters, like so... eon '/i//@/n (two sequences: "i", "@") Noah 'n/oU//@/ (two sequences: "oU", "@") ...then we get a correct syllable count. All we need to know now is what "vowel characters" means. The Moby Pronunciator documentation says that it uses all of the following characters (or sequences of them): a e i u o A E I O U y Y & @ - So we can count syllables by seeing how often this matches: m/[-\&yYaeiouAEIOU\@]+/g We can simply write that into our program that produces mpron.dat, by matching it against C<$pron> before we go deleting the slashes. =head2 Coping with (Syllabic) Stress Let's say this three-syllable word to rhyme with "toad" is needed, not merely for its austere artistic potency I, but because we need it to complete our Baudelairean opus magnum which ends: I chanced upon a lovely toad, It gleamed and danced like _____! ...I. In technical terms, you've got eight-syllable lines, with this metrical pattern (where slash means stressed, and underscore means unstressed): I chanced upon a lovely toad, _ / _ / _ / _ / It gleamed and danced like _______! _ / _ / _ / _ / So not only do you want the word you're after to have three syllables, but you want it to have a particular stress pattern. A word like: electrode _ / _ has the exactly the wrong stress pattern, even though it I three syllables long, and rhymes with "toad". (That's aside from "danced like electrode" being a bit ungrammatical -- but hey, this is I). While we're about to rebuild mpron.dat to have a field for each entry's syllable count in it, we might as well try to note syllable stress patterns too. Look at how stress is noted in the original data file, with commas and apostrophes: {Eds: bold the commas and apostrophes in this block} acclamatory /@/'kl/&/m/@/,t/oU/r/i/ acclimate /@/'kl/aI/m/I/t acclimation ,/&/kl/@/'m/eI//S//@/n Unfortunately, the apostrophe or comma that marks the following syllable as stressed (with primary or secondary stress) isn't right before the vowel that we'd match in order to count that syllable. If it I we could come up with a single regexp that would match any vowel cluster as well as its stress notation: m/([,']?)[-\&yYaeiouAEIOU\@]+/g and each time that matches, we just look in C<$1> to see what kind of stress this syllable would have. However, that's not the way the data is. As it is, we have to match the stress marks wherever they are, and then set a flag so that the following syllable will be marked as stressed (and in the absence of the flag, it will be marked unstressed). We can combine this with the syllable counter that works its way thru the word, based around this regexp: {Eds: bold the comma and apostrophe} m/[',]|[-\&yYaeiouAEIOU\@]+/g ...which we can work into part of the main loop for our converter program, so that it can cook up for each line in mpron.dat a field representing the meter of each word. {Footnote off of "meter" in that last paragraph: Usually "meter" is used for talking about the consistent stress pattern of whole lines of poetry -- but I'm using it here to refer to just the stress pattern of particular words, mostly because C<$meter> is I easier to type than C<$metrical_structure> or C<$stress_pattern>! } while() { chomp; ($word, $pron) = split(' ', $_); $meter = ''; # This is where we'll stack up a # '0', '1', or '2', one for each # vowel-character-group in this # word, as seen in $pron $next_stress_flag = '0'; # initial value foreach my $x ( $pron =~ m/[',]|[-\&yYaeiouAEIOU\@]+/g # loop over the vowels and accent marks # in $pron -- before we go changing $pron! ) { if($x eq ',') { $next_stress_flag = '2'; # secondary stress } elsif($x eq "'") { $next_stress_flag = '1'; # primary stress } else { # It's a vowel $meter .= $next_stress_flag; # Note it as another syllable $next_stress_flag = '0'; # Clear flag for next time } } # okay, NOW we can change it $pron =~ tr<>d; print OUT join("\t", $word, $pron, $meter), "\n"; } In case that whole business of C<$next_stress_flag> getting set in one iteration for use in the next doesn't make much sense to you, here's a rough English summary of how C<$meter> is devised for each word: {Italic blockquote style, or something} Each time a vowel-character cluster is found in this word's C<$pron>, add a character to C<$meter> representing the stress level of this syllable. If this syllable was preceded by an apostrophe, note this syllable as "1". If it was preceded by a comma, note this syllable as a "2". Otherwise, note it as a "0". What this whole bother then gives us is a mpron.dat file that now looks like this: accipitrine 0100 &ksIpItrIn Accius 100 &kSi@s acclaim 01 @kleIm acclamation 2010 &kl@meIS@n acclamation_medal 201010 &kl@meIS@nmEd-l acclamatory 01020 @kl&m@toUri acclimate 010 @klaImIt acclimation 2010 &kl@meIS@n acclimation_fever 201010 &kl@meIS@nfiv@r acclimatise 0102 @klaIm@taIz acclimatize 0102 @klaIm@taIz acclivity 0100 @klIvIti ...tab-separated, three fields to a line. If we merely want to know the number of syllables in a word, we just count the number of characters in the second field. But if we want to know more (like, to stipulate the stress pattern of those syllables), we have the data to do that, too. Now, to resume, recall that we're looking for a word that meets these criteria: * rhymes with "toad", * is three syllables long, * and those three syllables have to have the stress pattern B (stressed, unstressed, stressed). We figured out that we could formalize "rhymes with 'toad'" as a matter of matching the regexp C. But when it comes to matching the stress pattern of the word, we're thinking in terms of stressed and unstressed -- a two-term distinction -- but the data we've got (here from Moby Pronunciator, but most pronunciation databases do it this way) represents stress in terms of primary stress, secondary stress, and unstressed -- a three-term distinction. After some experimentation, I settled on this as the best way to reconcile these two systems: When I say "stressed": I mean having primary ("1") or secondary ("2") stress. When I say "usstressed": I mean having secondary ("2") stress, or no stress ("0"). So we can now formulate "I want the word to go DUM-duh-DUM" as a matter of its meter string matching the regexp C. Now, to pull off a search with these criteria, we could go back to our command-line grep pattern of % grep 'oUd$' mpron.dat and ammend it with: % grep 'oUd$' mpron.dat | grep '[12][02][12]' | more but all this grepping is getting rather cumbersome, and won't work terribly nicely with increasingly complex search patterns. In the end, it'd be so much simpler if we just wrote a custom (and therefore customizable!) search tool in Perl. =head2 A Simple C Searcher Since there's three fields in our database, it makes sense to be able to give search criteria for each or any of those three fields. And currently, using regular expressions seems the most powerful way to stipulate search patterns. So our each of our searches could be thought of as specified by three regular expressions: the first to match the spelling form of the word (probably not your primary interest, but it could be useful), the second to match the meter of the word, and the third to match the representation of the pronunciation of the word. So I figure this search tool (which we might as well just call C) could have the command line syntax: mpron I I I ...with the assumption that if we stipulate nothing for one or any of these regexps, then we're not imposing any limitation on that field. So "rhymes with toad" would be just a matter of: mpron '' '' 'ouD$' We can implement this simply with a program consisting of: ($word_re, $meter_re, $pron_re) = @ARGV[0,1,2]; open(IN, ' ", "Meter RE: <$meter_re> ", "Pron RE: <$pron_re>\n"; # Then looping over every line while() { chomp; print $_, "\n" # the matching line if ...it meets all our criteria... } {Eds: put that line starting "...it" in italics} Now, how do we formalize "it meets all our criteria"? We could just say: if $bits[0] =~ m/$word_re/oi # /o for "compile this regexp once", # /i for case insensitive -- I figure # that'd be useful for just $word_re && $bits[1] =~ m/$meter_re/o && $bits[2] = m/$pron_re/o but that makes sense only if we've provided all three criteria. We don't want to bother trying to match an element of C<@bits> against the contents of a variable like C<$meter_re> if there's nothing in that variable (i.e., if the search criterion it corresponds to is no criterion at all). So what we mean is that for each kind of test, we want the comparison to succeed if either 1) there was no search criterion, or 2) there was a criterion, and it matches. In terms of logical operators, this is an "or" relationship. Specifically, pass this test if: there was no criterion specified OR I pass the criterion passing each of the three criteria is a matter of matching the appropriate regexp, as with: $bits[1] =~ m/$meter_re/o As to how to express "there was no criterion specified", we can simply test the string length of the variable we'd look for the regexp in: !length($meter_re) ...which is true when C<$meter_re> holds a zero-length string. Put it all together and you get: !length($meter_re) || $bits[1] =~ m/$meter_re/o and, for all the tests put together: print $_, "\n" if (!length($word_re) || $bits[0] =~ m/$word_re/oi) && (!length($meter_re) || $bits[1] =~ m/$meter_re/o) && (!length($pron_re) || $bits[2] =~ m/$pron_re/o ); Incidentally, you can use the C operator (the low-precedence variant of C<&&>) to minimize the number of parentheses there, if you're comfortable doing so: print $_, "\n" if !length($word_re) || $bits[0] =~ m/$word_re/oi and !length($meter_re) || $bits[1] =~ m/$meter_re/o and !length($pron_re) || $bits[2] =~ m/$pron_re/o ; And that's all we've got to do to have a fully featured program to search any of mpron.dat's fields. So let's put it to work. Our command line for "find three syllable word, rhyming with 'toad', and having a DUM-duh-DUM stress pattern" would be simply: mpron '' '^[12][02][12]$' 'ouD$' The ^ and $ in C<^[12][02][12]$> is so that the stress pattern string must consist I of that stress pattern, instead of merely having that stress pattern in the word somewhere. So here we go! % mpron '' '^[12][02][12]$' 'ouD$' # Word RE: <> Meter RE: <^[12][02][12]$> Pron RE: alamode 102 &l@moUd antinode 102 &ntInoUd antipode 102 &ntIpoUd arillode 102 &r@loUd autocode 102 At@koUd a_la_mode 201 &l@moUd calicoed 102 k&l@koUd discommode 201 dIsk@moUd episode 102 EpIsoUd hemipode 102 hEmIpoUd incommode 201 Ink@moUd internode 102 Int@rnoUd keratode 102 kEr@toUd Kozhikode 101 koUZIkoUd manucode 102 m&nj@koUd megapode 102 mEg@poUd microcode 102 maIkroUkoUd nematode 102 nEm@toUd Nesselrode 102 nEs@lroUd overstowed 201 oUv@rstoUd palinode 102 p&lInoUd pigeon-toed 102 pIdZ@ntoUd porticoed 102 poUrt@koUd staminode 102 st&m@noUd superload 102 sup@rloUd trematode 102 trEm@toUd waggonload 102 w&g@nloUd Poetry in motion -- or rather, in automation! =head2 Accomodating Another Notation One minor quibble, though: it's a bit cumbersome converting our B (DUM-duh-DUM) notation into the regexp C<^[12][02][12]$>. We should be able to have our program accept that notation. We can do that by just adding, very early in our program, some code that would convert from that notation (if that's what it sees) into regexp notation. Namely: if($meter_re =~ m<^[/_]+$>) { # If the string consists entirely of # slashes and underscores... $meter_re =~ s<[12]>g; $meter_re =~ s<_><[20]>g; $meter_re = '^' . $meter_re . '$'; } So this would translate B to C<^[12][02][12]$> as the second argument, the one for matching the meter, as you can see here: % mpron '' '/_/' 'oUd$' | less # Word RE: <> Meter RE: <^[12][20][12]$> Pron RE: alamode 102 &l@moUd ...and so on... {Eds: bold the ^[12][20][12]$ there} By the way, if you want B to mean "ends in DUM-duh-DUM" instead of specifically "consists entirely of DUM-duh-DUM", then you could change that last line to this instead: $meter_re = $meter_re . '$'; # no '^' at the beginning The only question left to answer is: what exactly did our poetic toad gleam and dance like? No program can tell you which of the twenty-six matching words (three-syllable, B, rhyming with "toad") that we found is I but given the circumstances, the choice is clear: I chanced upon a lovely toad, It gleamed and danced like microcode! __END__ Sean M. Burke uses Perl and the principles of Vogon poetics to develop haiku of immense destructive power. Information on downloading a copy of the free Moby Pronunciator database is available at http://www.netadventure.net/~sburke/bounce.cgi/mpron/ along with the text of the programs described here. =cut __END__