Copyright 1999 The Perl Journal. Reprinted with permission.

# Time-stamp: "1999-09-26 12:44:14 MDT"

=head1 Braille Contractions and Regular Expressions

by Sean M. Burke with Sheri Wells-Jensen

This article is about how I used regular expressions to solve a
particular the problem of how to take common English text and turn it
into Braille, correctly using all the possible shorthand-like
contractions and abbreviations that Braille provides.  This is basically
a case study in how a messy problem got solved straightaway with Perl --
but along the way I'll mention some of the new features to be found in
Perl 5.005's regular expressions, which I used along the way; and in the
end, I find a surprising commonality between regexps and natural
language writing systems.

=head2 Braille, and Braille Contractions

When I was a little kid, I had a children's book about Hellen Keller.  I
don't remember reading it; I just remember the back cover, which had the
Braille alphabet printed on it -- well, embossed, actually.  They had
the Roman letter "a" in ink, and then below that the embossed dot
pattern for Braille "a", and so on up to "z".  So I got the idea that
Braille printing is just like a letter-for-letter substitution cypher
for the Roman alphabet.

Then I started noticing on men's room doors that below "MEN" in big
Roman letters, there'd be presumably the same word in Braille -- but
sometimes the word would have three Braille characters, and sometimes just
two.  And I<that> I found perplexing.  I couldn't imagine how the word
"men" could end up only two characters long.

So I asked my friend Sheri, who's been reading and writing Braille since
she was a kid, how "men" could be just two characters long.  She explained
that Braille has contractions: the first character in "men" is "m", and the
second character is a contraction for "en".  (For the rest of this article,
I'll use wedges to denote contractions, so "<en>" will mean the
I<single> Braille character
{{
Footnote:  
Braille readers typically use the term "sign" to mean a Braille character.
}}
that is a contraction for the two Roman letters
"en".)  Moreover, some contractions are context-specific -- there's a
contraction for "com", but it can apply only at the start of words, as
in "computer".  These shorthand-like contractions make reading Braille
faster, and they make books in Braille less expensive and less bulky. 

There's contractions for "en", "st", "ea", "er", "the" (the letter
sequence, not just the word "the"), and lots more.  It occurred to me
that there were cases where I could imagine more than one way to apply
contractions.  Consider the two ways to encode the word "leather":

  l e a t h e r
  . \_/ \_/ \_/    -- contract "ther" as "<th><er>"

or

  l e a t h e r
  . \_/ \___/ .    -- contract "ther" as "<the>r"

In each case, you end up with a word four letters long, so brevity can't
be a tie-breaker.  I asked Sheri whether both were acceptible spellings
-- she said no; the second spelling, with "<the>r", is the only correct
way to encode "leather".  This reminded me of English hyphenation.  For
example, you I<could> hyphenate "boggling" as "bogg-ling" (sort of like
"gos-ling", maybe), but you could also hyphenate it as "bog-gling", and
that seems the I<best> way to do it, so that gets to be the I<correct>
way.

So I was curious how this would be implemented as an encoding algorithm.
Sheri explained that programs do exist for taking conventional English
text, and applying all the contractions possible, in the right places --
what I'll just call "Braille encoding".  Braille encoding programs can
be embedded in tactile displays for computers, so that when I send Sheri
email (unencoded), it shows on her display in encoded Braille.  Or a
Braille encoder can be a program that encodes a whole file at a time,
typically for preparing getting an unencoded text file ready for sending
to a Braille embosser, a kind of impact printer. 

One encoder program is from the National Federation of the Blind, and
is called NFBTRANS.
{{
Footnote:
NFBTRANS is by no means the only such program around.  Another widely
used program is DBT, Duxbury Braille Translator.  DBT, however, is a
commercial product, with so source available.  NFBTRANS is free,
and comes with source.
}}
I hunted down a copy of NFBTRANS, and looked at the
source code that come with it, and discovered that it was not just a
Braille encoder, but that it also had many features like those of a
typesetting
system like troff.  It was about seven thousand lines of C, and in the
guts of it, I did manage to find the encoding routines -- and I couldn't
make heads or tails of the source for them.  NFBTRANS reads the list of
contractions from an external data file that come with the program, but
I couldn't figure out exactly how the rules were applied -- namely, in
what order or priority.

In other words, the rules file would include rules like:

  "the" contracts in any context
  "er"  contracts in any context
  "th"  contracts in any context

...but how or why would NFBTRANS manage to always contract the end of
the word "leather" to "<the>r" and not "<th><er>"?  Were there other
"-ther-" words that could get "<th><er>" instead of "<the>r"?

To fill in the blanks in my understanding of Braille, and to try my hand
at modeling of a linguistic system in Perl, I decided to try to develop
an algorithm in Perl to correctly apply these rules, and to check them
against NFBTRANS's output.

=head2 Kinds of Linguistic Rule Systems

I'd had some experience with formal linguistic models of the subsystems
that make up natural languages.  Every time I'd seen a linguist create a
formal model of a subsystem of a language (whether English phonology, or
Hindi syntax, or something less abstract, like hyphenation), the rules
that model the system have been in one of the two kinds of frameworks
I'll explain below -- either the kind I call a "generativity system", or
the kind I call an "optimality system".  So I expected Braille encoding
to be one of these two sorts of rule systems, and the first
task was to figure out which it was. 

=head2 Generativity Systems

In a generativity system, there's a list of rules, to be applied in a
particular order.

Linguists often use this kind of rule system to model phonology, the
study of the way the sounds of language are represented in the brain.
Phonological models assume that words are stored in a somewhat
simplified, abstracted form, and that to work out the details of how
these to be actually pronounced, we apply a series of rules. 

For example, a phonological model of my dialect of American English
would assume that the "long i" sound is always stored as simply that.
But I, like most Americans, pronounce it differently in "kite" than in
"fly".  To account for this difference, phonological models say that
this sound starts out as just itself (here writen "/aj/"), but that
rules may apply to change it in some contexts.  This isn't just about
vowels -- the two different-sounding t's in "write" and in "writing"
supposedly both start out as just /t/, but a rule happens to change the
latter one to sound like a /d/.

The rules describing these two phonological changes could look like:

  * the diphthong /aj/ (as is "fly") changes to /Uj/ (as in "kite")
  when it's before /p/, /t/, or /k/.
   
  * the consonant /t/ changes to /d/ when it's between two vowels.

These rules, when applied in that order, are supposed to take 
words in a representation that doesn't reflect the finer distinctions
(like the difference between /aj/ and /Uj/), and change those
representations, to flesh them out with just those distinctions.

Now, translating rules like these into regular expressions used to be
hard, but now, with Perl 5.005 regexps, it's a snap, with "lookbehind"
and "lookahead":

  s/aj         # target : the vowel /aj/
    (?=[ptk])  # following context : /[ptk]/
   /Uj/gx  && print "R1 applies.  Output: $_\n";

  s/(?<=[aeiouIUj])   # preceding context : any vowel
    t                 # target : /t/
    (?=[aeiouIUj])    # following context : any vowel
   /d/gx   && print "Output of system: $_\n";

What this gets you over /aj[ptk]/ or /[aeiouIUj]t[aeiouIUJ]/ is that
with lookbehind (as in "(?<=[aeiouIUj])") or lookahead (as is
"(?=[aeiouIUj])"), what matches the lookahead and lookbehind parts of
the regexp aren't part of what gets matched by the regexp as a whole.
That is, text matching the lookbehind expression is free to precede the
"cursor" (the part of the regexp mechanism that pos() reports the
position of), and text matching the lookahead expression doesn't move
the cursor forward, as normal matching does.  (Also, the lookbehind text
doesn't end up in C<$&>, which you probably know as the "what matched?"
regexp variable, but which you can better think of as "now that we've
matched something, what was between where the point started, and where
it ended up?".)

Now, the above phonetic rules are greatly simplified for sake of
discussion, but you can see their application to "write + ing",
starting from its simplified abstract form /rajt + IN/:

  $_ = 'rajtIN';
  print "Input to system: $_\n";

  s/aj
    (?=[ptk])
   /Uj/gx  && print "R1 applies.  Output: $_\n";

  s/(?<=[aeiouIUj])
    t
    (?=[aeiouIUj])
   /d/gx   && print "R2 applies.  Output: $_\n";

  print "Output of system: $_\n";

This prints:

  Input to system: rajtIN
  R1 applies.  Output: rUjtIN
  R2 applies.  Output: rUjdIN
  Output of system: rUjdIN

And this gives the correct pronunciation for "writing" in my dialect,
/rUjdIN/.  Change the first line to "rajdIN" ("riding"), and neither
rule applies, and you get "rajdIN" out.  (So, yes, in my dialect,
"writing" and "riding" sound different.)  And, importantly, if you swap
the rules so that R2 applies I<before> R1, you get:

  Input to system: rajtIN
  R2 applies.  Output: rajdIN
  Output of system: rajdIN

(So, if I<your> dialect has "writing" and "riding" sounding the same, it
might be that rule-ordering is the only difference between your dialect
and mine, on this point.)

The ordering of the rules in generativity systems is crucial; if you
have the right rules in the wrong order, you get the wrong answer.
If Braille encoding were a generativity system, I'd need to figure out
how to order the rules from the NFBTRANS data table.

=head2 Optimality Systems

If a generativity system is one that gives rules that get you from input
to correct output, then an optimality system is one that takes all kinds
of even remotely conceivable possible output, and ranks them in order of
desirability ("optimality"); then the highest ranked one is the "correct
form" that's the output of the system.  (And if it seems to you that
generativity is like imperative programming, and optimality is like
logical programming -- E<agrave> la Prolog -- then you're basically
right.)

The algorithms I've seen that implement English hyphenation are
basically optimality systems.  I encourage interested readers to look at
the hyphenation algorithm in TeX (which you can see reiterated in Perl
in CPAN's TeX::Hyphen), but for sake of discussion, suppose you can
model English hyphenation with these rules for ranking candidate forms:

* Hyphenating between consonant letters is good.  (As in "gos-ling".)

* Hyphenating between a double consonant is good.  (As in "bit-ter".)

* Hyphenating between a consonant and a vowel is bad.  (As in
"gosl-ing".)

* If hyphenating leaves a word-fragment of just consonants, that's
I<really> bad.  (As in "g-osling" or "gosli-ng".)

In Perl, you could implement this as:

  use strict;
  my $in = 'boggling';
  my $best = $in;
  my $best_score = 0;

  my $Cons = 'bcdfghjklmnpqrstvwxz';
  my $Vowel = 'aeiouy';

  foreach my $i (1 .. (length($in) - 1)) {
    $_ = $in;
    my $score = 0;
    substr($_, $i, 0) = '-';

    ++$score if /[$Cons]-[$Cons]/oi;
    ++$score if /([$Cons])-\1/oi;
    --$score if /[$Cons]-[$Vowel]/oi;
    $score -= 10 if /^[$Cons]+-/oi || /-[$Cons]+$/oi;

    print " \"$_\" : score: $score\n";
    if($score > $best_score) {
      $best_score = $score;
      $best = $_;
    }
  }
  print "Best: \"$best\" with score of $best_score\n";

The output of this is:

   "b-oggling" : score: -11
   "bo-ggling" : score: 0
   "bog-gling" : score: 2
   "bogg-ling" : score: 1
   "boggl-ing" : score: -1
   "boggli-ng" : score: -10
   "bogglin-g" : score: -9
  best: "bog-gling" with score of 2

...so these rules seem to work right -- the highest-ranked form
("bog-gling") is the best; and incidentally, the second best
("bogg-ling") is not too bad, and from there on out it's all quite bad
("bo-ggling", "boggli-ng", etc.). 

Note that it doesn't matter here what I<order> you apply the rules in;
but it does matter what I<weight> gets attached to each rule.  In the
above example, I've kept it simple, but suppose we now add a rule that
means "the closer to the middle of the word you hyphenate, the better",
such as:

  $score += .5 *
   (length($in)
    - abs($i - length($in) / 2));

If we leave that weighting constant at .5, you still get "bog-gling"
coming out on top still.  Change it to a 1, and we have a tie between
"bog-gling" and "bogg-ling", since the point that "bog-gling" gets for
hyphenating between a double consonant is offset by the point that it
loses out on for the hyphen not being at the exact middle of the word.
And if we change the constatn to 1.5, and we get "bogg-ling" coming out
on top.

If NFBTRANS's rules somehow interacted with an optimality system with
ranking-rules like "give K points for every letter this contraction
saves", or possibly "give (or subtract?) L points if this word ends up
with contractions next to eachother" or "subtract a half-point if the
word ends in a contraction", then I'd need to first puzzle out what
these ranking-rules were, and then to figure out what their values for K
and L were.

=head2 Regexp Replacement as a First Hack

I did intuitively feel that Braille encoding somehow had a lot in common
with hyphenation (except that it was about contracting letters instead
of sticking hyphens inbetween them), suggesting that optimality was at
least part of the system.  On the other hand, optimality systems often
do have some generative component to them (since you need I<generate>
some candidates to apply the ranking-rules), so I figured that whether
the real answer would eventually be in a generativity system or an
optimality system, I'd probably have to work up something
generativity-based.

So I decided to take the rules from NFBTRANS's rules file, and cook them
up into a regular expression, and use that to perform all the
substitutions, in a one-pass regexp like C<s/($re)/&lookup($1)/eg>, an
approach I borrowed from Kevin Lenzo's article in TPJ#12.  I figured
that this approach would by no means behave correctly, but that the
cases where it didn't behave correctly would give me some kind of hint
as to whether a generativity system or an optimality system was called
for, and then from there I could start worrying about the ordering or
weighting of rules.

Granted, next to either generativity systems or optimality systems, a
big one-pass regular expression replacement seems pretty strange, but
Perl makes regexps so easy that it seemed the path of least resistance
for a first hack.

=head2 Contexts in Regular Expressions

Suppose the rule file consists of:

  "the" contracts to "<the>" in any context
  "er"  contracts to "<er>" in any context
  "th"  contracts to "<th>" in any context
  "ea"  contracts to "<ea>" only in the middle of words
  "ar"  contracts to "<ar>" in any context
  "ear" contracts to "e<ar>" in any context

Now, a regexp for "'the' in any context" would be a simple C</the/>.
But how to express "'ea' only in the middle of words"?  You
could do it with something like C</(?<=\w)ea(?=\w)/> to assert that
there be a word character before and after the "ea", but Perl already
provides a metacharacter for matching a word boundary (C<\b>), or the
absence of one (C<\B>).  In other words, what's needed here is the
absence of word boundaries before and after the "ea", and you can do
that with simply C</\Bea\B/>.  Translating the above mini-list
of Braille contraction rules into a regexp gives:

  %contraction_for = (
   'the' => '<the>',
     # Remember: I'm using "<the>" here
     # in place of the Braille ASCII, or Unicode,
     # code for the single Braille character
     # for that contracts "the"
   'er' =>  '<er>',
   'th' =>  '<th>',
   'ea' =>  '<ea>',
   'ar' =>  '<ar>',
   'ear' => 'e<ar>',
  );
  s/(the|ear|ar|\Bea|er|th)/$contraction_for($1)/eg;

Now, notice that I moved the longer strings to the start of the regexp.
This is a crucial point in understanding Perl's implementation of
regular expressions: given alternates, Perl will match the first one it
can match, regardless of whether there may be other later alternates
that are longer.  So if "C<th>" came before "C<the>" in that regexp, the
"C<th>" would always match, never letting "C<the>" match.  Moreover, if
there were a "C<\Bea\B>" and an "C<ea>" in the regexp, if the "C<ea>" came
first, it would block the "C<\Bea\B> from ever matching.  So, in producing
this regular expression, I had to make sure that the alternates started
out with the longest match strings, in the most specific contexts (like
"C<\Bfoo\B>"), and worked their way down to the shortest match strings, in
the most general contexts (like just plain "C<foo>").  This was a simple
matter of a Schwartzian transform, where the sorter function looked
like:

  ...
  sort {
    length($b->[0]) <=> length($a->[0])
      # sort first by length of the literal
    or $b->[1] <=> $a->[1]
      # then by context "precedence"
      # where "\B_" gets a higher "precedence"
      # score than "_"
  }
  ...
  
By this time, my code looked like about 180 lines of code for reading the
contraction rules file and making a regexp out of it, and only about 20
lines to perform the contractions, basically as a wrapper
around this:

  $word =~
   s/($my_re)/$contraction_for{$1}/oeg;

=head2 Embedding Code in Regular Expressions

I'd forgotten about one thing -- there could be rules that say
something like:

  "foo" contracts to "<X>" at word-start
  "foo" contracts to "<Y>" elsewhere

Now, you could model this pair of (fictitious!) rules with a regexp that
contains C<(...|\bfoo|foo|...)>, but then whether C<"\bfoo"> or C<"foo">
matches, you end with simply with C<$1> holding C<"foo">.  In other
words, while you can use "C<\b>" or real lookbehind/lookahead to select
for certain contexts, you don't end up knowing which context actually
matched. 

However, a new feature in Perl 5.005 is exactly what I needed here.  The
feature is still somewhat experimental, and a full discussion of it
would require an article of its own; but to make a long story short,
having C<(?{ CODE })> in a regular expression will cause I<CODE> (any
Perl code) to be evaluated whenever the regular expression engine hits
that point in trying to match the regular expression.  So, to
differentiate between "C<\bfoo>" and "C<foo>" matching, I add to the
regexp a bit of code to be executed at the end of each:

  $word =~
   s/
     (
      |  \bfoo  (?{ $x=1 })
      |    foo  (?{ $x=2 })
     )
    /$contract[$x]{$1}/eg;

Here, if "C<\bfoo>" is the bit of the regexp that matches, the last thing
that the match on it does is to execute the code C<$x=1>, (I'm just using
$x as a flag), so then the string that got matched (say, the
"foo" in "foolish") gets replaced with the value of
C<$contract[1]{'foo'}>.  But if it's the second bit of the regexp
("C<foo>") that matches (as in the "foo" in "buffoon"), C<$x=2> will get
executed, and so "foo" will get replaced with the value of
C<$contract[2]{'foo'}>.  This array-of-hashes C<@contract> (accessed like
$contract[I<context_flag_numeric>]{I<string_to_be_replaced>}) will have
been filled by the same part of the program that read the rules in and
created the regular expression.

=head2 Rules as Exceptions

Most linguistic models deal with exceptional cases (like, say, irregular
verbs) by simply exempting certain forms from the rules.  However,
NFBTRANS's rules table knows only rules and more rules, where more
specific rules stop more general rules from applying.  For example,
given:

  "the" contracts to "<the>" in any context
  "er"  contracts to "<er>" in any context
  "th"  contracts to "<th>" in any context
  "ea"  contracts to "<ea>" in the middle of words
  "ar"  contracts to "<ar>" in any context
  "ear" contracts to "e<ar>" in any context

the rule that replaces "the" with the single character "<the>" blocks the
more general rule of replacing "th" with "<th>" from applying.
Likewise, the rule that targets "ear" blocks the rule that targets "ea"
from applying -- which is why "heart" contracts to "heE<lt>ar>t", not
"hE<lt>ea>rt".

Now, for some reason, it's incorrect to contract the "ear" in "tearoom"
at all.  This I<could> be modelled by saying that "tearoom" is on a list
of exception words that are exempt from contractions, but then you have
I<two> kinds of data in the system -- the rules table, and the list of
exceptions.  The way NFBTRANS models this is to simply add another rule:

  "tearoom" contracts to "tearoom" in any context

A rule that replaces a word with itself seems entirely pointless, but
the point of it is to block the more general rules.
{{
Footnote:
Actually, the rule says "replace 'tearo' with 'tearo' in any context",
not specifying the whole world -- but I think the people who thought up
that rule thought of "tearoom" as the only word it could apply to, and
were not aware of the rarer "-tearo-" words: "Aotearoa" or
"stearolactone".
}}
The NFBTRANS rules file basically consists of:

* a few dozen general rules that implement the regular contractions,
like "ea".

* a few dozen more rules that implement more specific (yet still pretty
general) contractions like "ear",

* about a hundred rules for whole-word abbreviations like "qk" for
"quick", and then

* about a I<thousand> rules that are there simply to block the more
general rules.

The above "tearoom" rule is an example that happens to block I<all>
contractions for that word, but the average rule is meant to stop only
I<some> of the contractions from applying.  For example, applying the
normal contraction rules to "pineapple" would contract it as
"p<in><ea>pple".  However, this spelling is considered incorrect, so
there's a rule specifically to fix this with the correct spelling:

  "pineapple" contracts to "p<in>eapple" in any context

And so on for "phoenix" (to block the "en" from contracting), "coworker"
(to block the "ow" from contracting, lest they become cow orkers), and
all the way up to the longest rule, which replaces "psychedelic" with
"psy<ch>edelic" (blocking the "<ed>").

Incidentally, as with exceptions in hyphenation, most of these
exceptions to the normal Braille contraction rules are motivated by
wanting to not contract (or wanting I<to> hyphenate between) letters
that belong to different syllables or to different morphemes (roots,
prefixes, or suffixes, like "tea-room", "co-worker", etc.).  But the "ea"
in "create" I<does> contract to the single "<ea>" character, even though
it's clearly across two syllables.  Since the conventions for Braille
contractions evolved naturally -- just as general English spelling did
-- it's full of patterns, and messy exceptions, and messy exceptions to
the messy exceptions. 

=head2 Testing It

When I wrote the code that generated the big regular expression to match
everything that could be contracted, I initially fed it just a few
rules, then a few dozen, and when I did feed it the whole rule file, it
ended up making a single regular expression 14KB long.  The longest
regexp I'd ever heard of anyone ever compiling was the 6.5KB regexp for
matching RFC822-valid email addresses (in Appendix B of Jeffrey Friedl's
I<Mastering Regular Expressions>), and this was more than twice as long
-- worrisomely long.  I anticipated a screaming "OUT OF MEMORY" error,
or that the regexp would take two minutes of swapping to compile.  But
it compiled imperceptably fast, and without errors.

I took the "UNIX wordlist" (a motley wordlist of 25,000 English words,
one per line), and encoded it with a copy of NFBTRANS, so that I
could compare my algorithm's output it to.  I wrote a small program to
use my algoritm to encode the wordlist, print all cases where my
algorithm's output disagreed with the output of the inscrutible
NFBTRANS algorithm, and then report my algorithm's accuracy.

Guessing wildly, I expected no better than 50% similarity between the
outupt of my algorithm, and the output of NFBTRANS.

=head2 It's Alive -- Alive!

My first-hack one-pass regexp algorithm turned out to agree with
NFBTRANS 99.59% of the time.  I was boggled that it was over 50%, much
less over 99%.

Most of the cases of disagreement were words like "R&D" or "O'Leary" --
since this was a first hack, I had not bothered to properly deal with
words with internal punctuation or complex capitalization, which require
somewhat special treatment in Braille encoding.  The remaining dozen
words where my algorithm encoded differently than NFBTRANS were because
I'd misimplemented the regexp context for one of the more obscure rules.
(And incidentally, I've only implemented my encoder for words I<in
isolation> -- in encoding of actual sentences, there's punctuation to
deal with; and a few Braille contractions in words are sensitive to the
context of the word, like whether there's punctuation after the word,
which makes things a bit messier.)

This one-pass regexp approach to be just a stab in the dark, and I
expected to use the list of its failures, to refine the algorithm --
say, developing a generativity system that would make one pass to handle
word-initial contractions, then another to handle word-final
contractions, then a final pass for everything leftover inbetween.  Or
maybe the answer would lie in an optimality system to consider all
I<possible> ways to contract a word, and select the one that had the
fewest characters, maybe the fewest dots total, maybe the most contractions
closest to the beginning or end of the word, or the like.

But it was nothing so complicated -- this big regexp, in one pass, did
the whole job.  Not only did this seem much too easy (although this is a
not unfamiliar sensation in programming Perl), but it was quite
unprecedented -- I'd never heard of a phenomenon in natural language
that could be treated with a regular expression.

I'd always considered regular expressions to be essentially abstract and
mathematical.  After all, the formalism of regular expressions was
devised just this century, coming out of the work of mathematician
Stephen Kleene; and the workings of the regexp engine (as in Mark-Jason
Dominus's article in TPJ#9) are explained in terms of automata.  And
whatever an automaton is (a big scary robot?), it sure doesn't sound
like anything to do with something that people do unconsciously when
they write Braille (And, incidentally, people have been been writing
English in Braille this way for decades longer than regexps have been
around.)

But in spite of the very formalist origins of regular expressions, I
think that the basic mechanism underlying regexp replacement -- scanning
left to right, matching as much as you can, context permitting -- is not
so unnatural and abstract.  That's what we do when we use systems of
contraction or abbreviation, whether in Braille, Gregg shorthand, or our
own idiosyncratic abbreviations: we start at the beginning of the word
(as opposed to inking in the end, and working backwards from there), and
we abbreviate as much, and as early, as possible. 

Granted, Perl regexps have plenty of features that have no parallel in
this process of Braille contractions (like backtracking, notably).  But
there's the commonality, in that what's called "greedy matching" for
regexps, is effectively the same as what we do when we
abbreviate/contract, because of a mix of economy (wanting to save space)
and laziness (not having to operate on the whole word at a time, or to
consider lots of alternatives, but just contracting in one pass, as you
write, starting at the beginning of the word).

Perl programmers know from experience that regexps are immensely useful,
and my experience with Braille encoding is just another case of that.
But the fact that regexps have a basic commonality with a linguistic
phenomenon like Braille encoding suggests that not only are regexps a
useful tool, but that they're also linguistically natural and,
hopefully, more learnable and valuable because of it.

__END__

Sean M. Burke (C<sburke@netadventure.net>) likes making linguists think
he's a programmer, and programmers think he's a linguist. 

Sheri Wells-Jensen (C<sheri@alpha.sg.inter.edu>) is a linguist who used
to be a programmer, way back when everyone thought FORTH was a pretty
neat idea.

(Sean adds: thanks to Tack-Tiles (tack-tiles.com) for lending me a set
of the Braille blocks they make, and thanks to David Ondrik for taking
the picture of them.  And I'd also like to thank the conferees at this
year's YAPC for their encouraging response to my talk on Braille
encoding with regular expressions.)

The final Braille-encoding code is available, along
with some other Braille links, at
C<http://www.netadventure.net/~sburke/bounce.cgi/braille>

----

Caption for the photo:

Alphabet blocks for Braille!  The text spelled out is:
"<capital>The <capital>PerlE<lt>number>5 <com>pilE<lt>er> r<ea><ch><ed>
lE<lt>in>e <number>850 <and> diE<lt>ed>."

[I think that's what it says... I'll have to check.]

----

Text for the box:

=head2 Sample Text in Braille

The first paragraph of Chapter 3 of Sincair Lewis's I<Babbit>:

It was the best of nationally advertised and quantitatively produced
alarm-clocks, with all modern attachments, including cathedral chime,
intermittent alarm, and a phosphorescent dial.  Babbitt was proud of
being awakened by such a rich device.  Socially it was almost as
creditable as buying expensive cord tires.

[GIF HERE: from http://www.speech.cs.cmu.edu/~sburke/braille/babbit.gif ]

Under each Braille line is the "Braille ASCII" version -- simply the
Braille characters mapped onto 7-bit ASCII codes.  You'll notice that the
character mapped to "5" is the contraction "<en>"; the character mapped to "/"
is the contraction "<st>"; and the the character mapped to "," means "the
next word is capitalized" when it precedes a word, but on the inside of
words, it's a prefix for particular abbreviations (e.g., it plus "y"
means "-ally").

The e-text of all of I<Babbit> is 717K unencoded (ASCII), and 581K
encoded (grade-2 Braille, in Braille ASCII) -- a size savings of 19%.