# Time-stamp: "1999-10-08 01:11:47 MDT sburke@stonehenge.netadventure.net"

=head1 Searching for Rhymes with Perl

=head2 Sean M. Burke

  La poésie doit être faite par tous.
  Poetry is for I<everyone> to make.
   -- Lautréamont (Isidore Ducasse, 1846-1870)

Wherever I go, people always come up to me and say "Sean, you gotta
help me -- I need to find a three-syllable word that rhymes with
'toad'."  And my answer is always the same; I always say "well, we're
going to have te pull out the Perl for this one!"

Because, while I<The Perl Journal> articles constantly demonstrate
that Perl is good at everything from designing sundails to peppering
irc with Eliza bots, one thing that it's I<really really> good at is
making short little programs for searching text.  And that's what this
article is about -- how to search text (specifically wordlists or
pronunciation databases) for rhymes of various kinds.

=head2 Where To Look

If this article were about rhyming in Spanish or Italian, or Finnish,
it'd be a whole lot simpler!  Because, for the most part, the way
something is spelled in these languages tells you pretty well how to
pronounce it; ending with the same letters may not be exactly the same
thing as rhyming, but often you can start with the spelling and apply
some trivial string replacement operations to get a phonetic form that
can be searched for the presence of a rhyme.  This can work even with
with French, where (for the most part) spelling tells you
pronunciation, even though the pronunciation won't tell you the
spelling.

However, English sure isn't that kind of language -- not only does the
English pronunciation of a word I<not> tell you how to spell it, its
spelling doesn't tell you how to pronounce it.  But luckily, lexicons
exist that tell are basically simple databases, corresponding the
normal written form of a word to some representation of its
pronunciation.  One of my favorite lexicons (partly because it's
I<free!>) is Moby Pronunciator.  It consists of about 177,000 entries,
one word to a line, that look like this:

  ...
  accipitrine /&/k's/I/p/I/tr/I/n
  Accius '/&/k/S//i//@/s
  acclaim /@/'kl/eI/m
  acclamation ,/&/kl/@/'m/eI//S//@/n
  acclamation_medal ,/&/kl/@/'m/eI//S//@/n_'m/E/d/-/l
  acclamatory /@/'kl/&/m/@/,t/oU/r/i/
  acclimate /@/'kl/aI/m/I/t
  acclimation ,/&/kl/@/'m/eI//S//@/n
  acclimation_fever ,/&/kl/@/'m/eI//S//@/n_'f/i/v/@/r
  acclimatise /@/'kl/aI/m/@/,t/aI/z
  acclimatize /@/'kl/aI/m/@/,t/aI/z
  acclivity /@/'kl/I/v/I/t/i/
  ...

Without bothering just now with the values of these symbols, you can
see that (as the README will tell you), the format of each line is the
word (or underscore-separated multiword phrase, like
"acclamation_medal"), then a space, then the phonetic notation.  What
the slashes mean (and why there isn't one between the /k/ and /l/ in
"acclaim", etc.) is something I'm unsure of.  But I am sure that these
slashes are annoying, and get in the way of me trying to actually
search, since I have to always remember to stick them in my search
patterns, always worrying that I stuck in one too many.  And the same
goes for the commas and apostrophes, which indicate stress, since when
I'm looking for a rhyme, I may not care about stress.

=head2 Preparing the Data

So the first thing to do, whether it's for the Moby Pronunciator
wordlist in specific, or for any other wordlist you choose to use
instead, is to strip out the parts you don't want and to take what's
left and format it the way you like.  Here we can just do that by
deleting certain tokens in the pronunciation part:

  slashes (used to separate phonemes?)
  spaces and underscores (used to separate words)
  apostrophes (used to precede syllables with primary stress)
  commas (used to precede syllables with secondary stress)

Since these tokens are all single characters, we can delete them by
just applying a C<tr> operator, with the C<d> switch ("d" for delete),
to slashes, spaces, underscores, commas, and apostrophes:

  tr/\/ _,'//d;

Personally, I find it disconcerting to have the backslash-escaped
slash in there, so I tend to use different delimiters, like matching
wedges, for C<tr> to do just the same thing:

  tr</ _,'><>d;

Either way, you can build this into a program that reads the Mody
Pronunciator database:

  open(IN, '<mobypron.unc') or die $!;
  while(<IN>) {
    chomp;
    ($word, $pron) = split(' ', $_);
    $pron =~ tr</ _,'><>d;
    ... then do something with $word and $pron ...
  }

Now, to get to searching this database for rhymes (or any other
phonetic information).  There's two ways to go about it:

* use the code above, and once you'd modified C<$pron>, search it for
a pattern; or:

* write C<$word> and the modified C<$pron> to a file, and then use
grep on that file.

The benefit of the former is simplicity, but the benefit of the latter
is efficiency -- no need to constantly chomp, split, and C<tr> for
each line.  Now, normally I say that program (as opposed to
programmer) efficiency is overvalued in programming; but in this case,
since the Moby wordlist is so very large, and since that makes the
first approach so wasteful, I say the second approach the one to take.
So we can save each line's C<$word> and C<$pron> values to a file
called "mpron.dat", like so:

  open(IN, '<mobypron.unc') or die $!;
  open(OUT, '>mpron.dat') or die $!;
  while(<IN>) {
    chomp;
    ($word, $pron) = split(' ', $_);
    $pron =~ tr</ _,'><>d;
    print OUT $word, "\t", $pron, "\n";
     # tab makes a nice delimiter
  }

The resulting file, mpron.dat looks like:

  ...
  accipitrine         &ksIpItrIn
  Accius              &kSi@s
  acclaim             @kleIm
  acclamation         &kl@meIS@n
  acclamation_medal   &kl@meIS@nmEd-l
  acclamatory         @kl&m@toUri
  acclimate           @klaImIt
  acclimation         &kl@meIS@n
  acclimation_fever   &kl@meIS@nfiv@r
  acclimatise         @klaIm@taIz
  acclimatize         @klaIm@taIz
  acclivity           @klIvIti
  ...

=head2 Searching The Prepared Data

So, with that file prepared, we can grep it for a whatever pattern we
want in the pronunciation.  Suppose we're still after a three-syllable
word that rhymes with "toad".  The idea of rhyme in English is a
pretty straightforward matter: if two words rhyme, this means they end
in the some sounds (generally the last vowel and any consonants
following it).  If I were quite familiar with the phonetic notation
for Moby Pronunciator (or whatever alternate pronunciation database
you might use), I could, off the top of my head, say how to represent
the sound "-oad" (from "toad").  However, I've never bothered, since
so easy to just look up the word you want to rhyme with, and see how
it's represented:

  % grep '^toad' mpron.dat
  toad                toUd
  toad's-mouth        toUdzmouT
  toadeater           toUdit@r
  toadfish            toUdfIS
  toadflax            toUdfl&ks
  toadstone           toUdstoUn
  toadstool           toUdstul
  toadstool_disease   toUdstuldIziz
  toady               toUdi

So C<oUd> it is!

  % grep 'oUd$' mpron.dat
  abode           @boUd
  access_road     &ksEsroUd
  acnode          &knoUd
  Aeolian_mode    ioUli@nmoUd
  alamode         &l@moUd
  Alexis_Claude   AlEksikloUd
  all-hallowed    Olh&loUd
  anchor_rode     &Nk@rroUd

...and 281 other matches, ending with zip_code (C<zIpkoUd>).  But
there's so many because we've not limited that to three-syllable
words.  So how do we do that?

=head2 Counting Syllables

As with most models of syllables in most languages, an English
syllable is basically a vowel sound with some number of consonants
before and/or after it.  Now, actually settling on what consonants go
with what vowels is a sticky subject (is rostrum C<rAs-tr@m> or
<rA-str@m>, or what?), but since all we want to do now is I<count> the
syllables, we merely need to count the number of vowel sounds.

You've seen that some vowel sounds, like the long "o" sound in "toad"
are represented by a pair of ASCII characters, "oU".  That means that
we can't simply count the number of vowel characters in the
pronunciation string, because then "oU" would count as two.  We could
count the number of times we find a I<sequence> of some number of
vowel characters, but that would match only once in each of these
two-syllable words:

  eon   i@n   (one sequence: "i@")
  Noah  noU@  (one sequence: "oU@")

(The C<@> character here represents the "uh" sound in unstressed
syllables.)  However, if we go back to the format that the original
Moby Pronunciator file is in (as opposed to our cooked mpron.dat
file), we see that those slashes can do us some good:

  eon '/i//@/n
  Noah 'n/oU//@/

One thing that is consistent with slashes in the input file is that
they are there (at least one of them) between vowels in different
syllables, as above.  So where the vowels in the two syllables in "Noah"
in our prepared file run together, they are still separate in the
original file.  This means that if we start with the original form of
the pronunciation entry, and then count the number of occurrences of
sequences of vowel characters, like so...

  eon '/i//@/n    (two sequences: "i", "@")
  Noah 'n/oU//@/  (two sequences: "oU", "@")

...then we get a correct syllable count.  All we need to know now is
what "vowel characters" means.  The Moby Pronunciator documentation
says that it uses all of the following characters (or sequences of
them):

  a e i u o  A E I O U  y Y  & @ - 
 
So we can count syllables by seeing how often this matches:

  m/[-\&yYaeiouAEIOU\@]+/g

We can simply write that into our program that produces mpron.dat, by
matching it against C<$pron> before we go deleting the slashes.

=head2 Coping with (Syllabic) Stress

Let's say this three-syllable word to rhyme with "toad" is needed, not
merely for its austere artistic potency I<as such>, but because we
need it to complete our Baudelairean opus magnum which ends:

  I chanced upon a lovely toad,
  It gleamed and danced like _____!

...I<DUM-duh-DUM>.  In technical terms, you've got eight-syllable
lines, with this metrical pattern (where slash means stressed, and
underscore means unstressed):

  I chanced upon a lovely toad,
  _   /     _ /  _  /   _  /

  It gleamed and danced like _______!
  _    /     _    /      _   / _ /

So not only do you want the word you're after to have three syllables,
but you want it to have a particular stress pattern.  A word like:

  electrode
  _ /   _

has the exactly the wrong stress pattern, even though it I<is> three
syllables long, and rhymes with "toad".  (That's aside from "danced
like electrode" being a bit ungrammatical -- but hey, this is
I<poetry!>).  While we're about to rebuild mpron.dat to have a field
for each entry's syllable count in it, we might as well try to note
syllable stress patterns too.

Look at how stress is noted in the original data file, with commas and
apostrophes:
{Eds: bold the commas and apostrophes in this block}

  acclamatory /@/'kl/&/m/@/,t/oU/r/i/
  acclimate /@/'kl/aI/m/I/t
  acclimation ,/&/kl/@/'m/eI//S//@/n

Unfortunately, the apostrophe or comma that marks the following
syllable as stressed (with primary or secondary stress) isn't right
before the vowel that we'd match in order to count that syllable.  If
it I<were,> we could come up with a single regexp that would match any
vowel cluster as well as its stress notation:

  m/([,']?)[-\&yYaeiouAEIOU\@]+/g

and each time that matches, we just look in C<$1> to see what kind of
stress this syllable would have.  However, that's not the way the data
is.

As it is, we have to match the stress marks wherever they are, and
then set a flag so that the following syllable will be marked as
stressed (and in the absence of the flag, it will be marked
unstressed).  We can combine this with the syllable counter that works
its way thru the word, based around this regexp:

{Eds: bold the comma and apostrophe}

  m/[',]|[-\&yYaeiouAEIOU\@]+/g

...which we can work into part of the main loop for our converter
program, so that it can cook up for each line in mpron.dat a field
representing the meter of each word.

{Footnote off of "meter" in that last paragraph:
Usually "meter" is used for talking about the consistent stress
pattern of whole lines of poetry -- but I'm using it here to refer to
just the stress pattern of particular words, mostly because C<$meter>
is I<much> easier to type than C<$metrical_structure> or
C<$stress_pattern>!
}

  while(<IN>) {
    chomp;
    ($word, $pron) = split(' ', $_);

    $meter = '';
     # This is where we'll stack up a
     # '0', '1', or '2', one for each
     # vowel-character-group in this
     # word, as seen in $pron
    
    $next_stress_flag = '0'; # initial value
    foreach my $x (
     $pron =~ m/[',]|[-\&yYaeiouAEIOU\@]+/g
      # loop over the vowels and accent marks
      # in $pron -- before we go changing $pron!
    ) {
      if($x eq ',') {
        $next_stress_flag = '2'; # secondary stress
      } elsif($x eq "'") {
        $next_stress_flag = '1'; # primary stress
      } else {
        # It's a vowel
        $meter .= $next_stress_flag;
         # Note it as another syllable
        $next_stress_flag = '0';
         # Clear flag for next time
      }
    }

    # okay, NOW we can change it
    $pron =~ tr</ _,'><>d;

    print OUT
     join("\t", $word, $pron, $meter), "\n";
  }

In case that whole business of C<$next_stress_flag> getting set in one
iteration for use in the next doesn't make much sense to you, here's a
rough English summary of how C<$meter> is devised for each word:

{Italic blockquote style, or something}

  Each time a vowel-character cluster is found in this word's
  C<$pron>, add a character to C<$meter> representing the stress level
  of this syllable.  If this syllable was preceded by an apostrophe,
  note this syllable as "1".  If it was preceded by a comma, note this
  syllable as a "2".  Otherwise, note it as a "0".

What this whole bother then gives us is a mpron.dat file that now
looks like this:

  accipitrine        0100    &ksIpItrIn
  Accius             100     &kSi@s
  acclaim            01      @kleIm
  acclamation        2010    &kl@meIS@n
  acclamation_medal  201010  &kl@meIS@nmEd-l
  acclamatory        01020   @kl&m@toUri
  acclimate          010     @klaImIt
  acclimation        2010    &kl@meIS@n
  acclimation_fever  201010  &kl@meIS@nfiv@r
  acclimatise        0102    @klaIm@taIz
  acclimatize        0102    @klaIm@taIz
  acclivity          0100    @klIvIti

...tab-separated, three fields to a line.  If we merely want to know
the number of syllables in a word, we just count the number of
characters in the second field.  But if we want to know more (like, to
stipulate the stress pattern of those syllables), we have the data to
do that, too.

Now, to resume, recall that we're looking for a word that meets these
criteria:

* rhymes with "toad",
* is three syllables long,
* and those three syllables have to have the stress pattern B</_/>
(stressed, unstressed, stressed).

We figured out that we could formalize "rhymes with 'toad'" as a
matter of matching the regexp C<m/oUd$/>.  But when it comes to
matching the stress pattern of the word, we're thinking in terms of
stressed and unstressed -- a two-term distinction -- but the data
we've got (here from Moby Pronunciator, but most pronunciation
databases do it this way) represents stress in terms of primary
stress, secondary stress, and unstressed -- a three-term distinction.

After some experimentation, I settled on this as the best way to
reconcile these two systems:

 When I say "stressed":
   I mean having primary ("1") or secondary ("2") stress.
 When I say "usstressed":
   I mean having secondary ("2") stress, or no stress ("0").

So we can now formulate "I want the word to go DUM-duh-DUM" as a
matter of its meter string matching the regexp C</[12][02][12]/>.

Now, to pull off a search with these criteria, we could go back to our
command-line grep pattern of

  % grep 'oUd$' mpron.dat

and ammend it with:

 % grep 'oUd$' mpron.dat | grep '[12][02][12]' | more

but all this grepping is getting rather cumbersome, and won't work
terribly nicely with increasingly complex search patterns.  In the
end, it'd be so much simpler if we just wrote a custom (and therefore
customizable!) search tool in Perl.

=head2 A Simple C<mpron> Searcher

Since there's three fields in our database, it makes sense to be able
to give search criteria for each or any of those three fields.  And
currently, using regular expressions seems the most powerful way to
stipulate search patterns.  So our each of our searches could be
thought of as specified by three regular expressions: the first to
match the spelling form of the word (probably not your primary
interest, but it could be useful), the second to match the meter of
the word, and the third to match the representation of the
pronunciation of the word.

So I figure this search tool (which we might as well just call
C<mpron>) could have the command line syntax:

mpron I<spelling_re> I<stress_re> I<pron_re>

...with the assumption that if we stipulate nothing for one or any of
these regexps, then we're not imposing any limitation on that field.
So "rhymes with toad" would be just a matter of:

  mpron  ''  ''  'ouD$'

We can implement this simply with a program consisting of:

  ($word_re, $meter_re, $pron_re) = @ARGV[0,1,2];
  open(IN, '<mpron.dat')
   or die "Can't read-open mpron.dat: $!";

  print  # For the record, note our input
    "# Word RE: <$word_re>  ",
    "Meter RE: <$meter_re>  ",
    "Pron RE: <$pron_re>\n";

  # Then looping over every line
  while(<IN>) {
    chomp;
    print $_, "\n" # the matching line
     if
      ...it meets all our criteria...
  }

{Eds: put that line starting "...it" in italics}

Now, how do we formalize "it meets all our criteria"?  We could just
say:

  if
      $bits[0] =~ m/$word_re/oi
       # /o for "compile this regexp once",
       # /i for case insensitive -- I figure
       #    that'd be useful for just $word_re
   && $bits[1] =~ m/$meter_re/o
   && $bits[2] =  m/$pron_re/o

but that makes sense only if we've provided all three criteria.  We
don't want to bother trying to match an element of C<@bits> against
the contents of a variable like C<$meter_re> if there's nothing in
that variable (i.e., if the search criterion it corresponds to is no
criterion at all).

So what we mean is that for each kind of test, we want the comparison to
succeed if either 1) there was no search criterion, or 2) there was a
criterion, and it matches.  In terms of logical operators, this is an
"or" relationship.  Specifically,

  pass this test if:
    there was no criterion specified
    OR
    I pass the criterion

passing each of the three criteria is a matter of matching the
appropriate regexp, as with:

  $bits[1] =~ m/$meter_re/o

As to how to express "there was no criterion specified", we can simply
test the string length of the variable we'd look for the regexp in:

  !length($meter_re)

...which is true when C<$meter_re> holds a zero-length string.  Put it
all together and you get:

  !length($meter_re) || $bits[1] =~ m/$meter_re/o

and, for all the tests put together:

  print $_, "\n" if
     (!length($word_re)  || $bits[0] =~ m/$word_re/oi)
  && (!length($meter_re) || $bits[1] =~ m/$meter_re/o)
  && (!length($pron_re)  || $bits[2] =~ m/$pron_re/o );

Incidentally, you can use the C<and> operator (the low-precedence
variant of C<&&>) to minimize the number of parentheses there, if
you're comfortable doing so:

  print $_, "\n" if
      !length($word_re)  || $bits[0] =~ m/$word_re/oi
  and !length($meter_re) || $bits[1] =~ m/$meter_re/o
  and !length($pron_re)  || $bits[2] =~ m/$pron_re/o ;

And that's all we've got to do to have a fully featured program to
search any of mpron.dat's fields.

So let's put it to work.  Our command line for "find three syllable
word, rhyming with 'toad', and having a DUM-duh-DUM stress pattern"
would be simply:

  mpron  ''  '^[12][02][12]$'  'ouD$'

The ^ and $ in C<^[12][02][12]$> is so that the stress pattern string
must consist I<entirely> of that stress pattern, instead of merely
having that stress pattern in the word somewhere.  So here we go!

  % mpron  ''  '^[12][02][12]$'  'ouD$'
  # Word RE: <>  Meter RE: <^[12][02][12]$>  Pron RE: <oUd$>
  alamode         102     &l@moUd
  antinode        102     &ntInoUd
  antipode        102     &ntIpoUd
  arillode        102     &r@loUd
  autocode        102     At@koUd
  a_la_mode       201     &l@moUd
  calicoed        102     k&l@koUd
  discommode      201     dIsk@moUd
  episode         102     EpIsoUd
  hemipode        102     hEmIpoUd
  incommode       201     Ink@moUd
  internode       102     Int@rnoUd
  keratode        102     kEr@toUd
  Kozhikode       101     koUZIkoUd
  manucode        102     m&nj@koUd
  megapode        102     mEg@poUd
  microcode       102     maIkroUkoUd
  nematode        102     nEm@toUd
  Nesselrode      102     nEs@lroUd
  overstowed      201     oUv@rstoUd
  palinode        102     p&lInoUd
  pigeon-toed     102     pIdZ@ntoUd
  porticoed       102     poUrt@koUd
  staminode       102     st&m@noUd
  superload       102     sup@rloUd
  trematode       102     trEm@toUd
  waggonload      102     w&g@nloUd

Poetry in motion -- or rather, in automation!

=head2 Accomodating Another Notation

One minor quibble, though: it's a bit cumbersome converting our B</_/>
(DUM-duh-DUM) notation into the regexp C<^[12][02][12]$>.  We should
be able to have our program accept that notation.  We can do that by
just adding, very early in our program, some code that would convert
from that notation (if that's what it sees) into regexp notation.
Namely:

  if($meter_re =~ m<^[/_]+$>) {
    # If the string consists entirely of
    #  slashes and underscores...
    $meter_re =~ s</><[12]>g;
    $meter_re =~ s<_><[20]>g;
    $meter_re = '^' . $meter_re . '$';
  }

So this would translate B</_/> to C<^[12][02][12]$> as the second
argument, the one for matching the meter, as you can see here:

  % mpron '' '/_/' 'oUd$' | less
  # Word RE: <>  Meter RE: <^[12][20][12]$>  Pron RE: <oUd$>
  alamode         102     &l@moUd
  ...and so on...

{Eds: bold the ^[12][20][12]$ there}

By the way, if you want B</_/> to mean "ends in DUM-duh-DUM" instead
of specifically "consists entirely of DUM-duh-DUM", then you could
change that last line to this instead:

    $meter_re = $meter_re . '$';
     # no '^' at the beginning

The only question left to answer is: what exactly did our poetic toad
gleam and dance like?  No program can tell you which of the twenty-six
matching words (three-syllable, B</_/>, rhyming with "toad") that we
found is I<le mot juste,> but given the circumstances, the choice is
clear:

  I chanced upon a lovely toad,
  It gleamed and danced like microcode!

 __END__

Sean M. Burke uses Perl and the principles of Vogon poetics to develop
haiku of immense destructive power.

Information on downloading a copy of the free Moby Pronunciator
database is available at
http://www.netadventure.net/~sburke/bounce.cgi/mpron/ along with the
text of the programs described here.

=cut

__END__