Simulating Typos with Perl

Sean M. Burke
TPJ Issue #20

Quoth the raven, "Nwvermpre!"

About two years ago, I switched to typing on the Dvorak keymap. That meant going from the Sholes "QWERTY" keymap:

 ` 1 2 3 4 5 6       7 8 9 0 - = \
    q w e r t       y u i o p [ ]
     a s d f g       h j k l ; '
      z x c v b       n m , . /

to August Dvorak's more efficiency-minded keymap:

 ` 1 2 3 4 5 6       7 8 9 0 [ ] \
    ' , . p y       f g c r l / =
     a o e u i       d h t n s -
      ; q j k x       b m w v z

It was just a matter of switching the keymap preferences on whatever computers I had to type on, and then a few days of acclimating to all the keys having moved. This had the two desired effects: my hands would no longer ache after marathon coding sessions, and no one ever touched my computer again.

But there was one side effect I hadn't anticipated: a different keymap means different kinds of typos. This became evident to me first on IRC -- since IRC is a medium characterized by people typing faster than they can think, typos abound:

  <Wuglife> I hear it's out on video now
  me> I know, I sow it a wook age.
  <Wuglife> sow?
  <Koolmodey> wook age?
  me> I mean I sAw it a wEEk agO.
  <Koolmodey> guh, how do you manage to aim for
     'e' and hit 'o' instead?  they're on
     different sides of the keyboard
  me> They're right next to eachother on mine.
     I use a Dvorak keyboard.  The middle row
     goes: "aoeuidhtns".
  <Koolmodey> that's because you're a communist
  me> columnist
  <Koolmodey> yea like dvorak
  me> different Dvorak.  August, not John.
  <Mugsy> whatEVERRRR
  <Wuglife> i like pie

Over time I did get the feeling that typos on a Dvorak keyboard were really consistently different. At least for me, the typos I'd made on QWERTY keyboards were either transposition ("hte" for "the") or hitting a key adjacent to the intended one. On a Dvorak, transposition errors are more or less the same, but adjacent-key errors are, naturally, rather different -- if you miss to the left or right of a QWERTY "e", you hit "w" or "r", but to the left or right of a Dvorak "e" is "o" and "u". So the equivalents of "fwlt" or "frlt" on a QWERTY become "folt" or "fult" on a Dvorak.

I had the feeling that Dvorak typos were, on the whole, much less likely to "look like typos", compared to QWERTY typos. Whereas "fwlt" and "frlt" couldn't possibly be words, "folt" and "fult" look like plausible words that happen not to exist. And sometimes the typo does make for an existing word -- one off from "seen" is "soon", one off from "be" is "me", and so on. This isn't something completely exclusive to a Dvorak -- on a QWERTY, "fear" and "dear" are just one key off -- but I had a feeling it was happening much more frequently with the Dvorak.

Now, looking at the keymap, it stands to reason -- but then, lots of things stand to reason that don't actually happen (like, say, everyone abandoning the QWERTY keymap, or having done so decades ago). So I decided that the best way to test this would be to write some sort of program to simulate typos on a Dvorak and on a QWERTY, have it generate lots and lots of typos, and see what the results would be.

First off, this might tell me whether I was just imagining things, or whether this was a measurable (and simulatable) property of typing on a Dvorak versus typing on a QWERTY. Moreover, the code developed could be of use in catching common typos -- a capability important in spelling-correction algorithms, whether in actual spellcheckers or programs that, given a failed URL or email address, can suggest to the user an alternative. More perversely, one could use typo-simulating code to lend a hint of authenticity to a chatbot (see the TPJ #9 article "Chatbot::Eliza" by John Nolan and the TPJ #10 article "Infobots and Purl" by Kevin Lenzo).

Simulating the Typos

For sake of simplicity, I figure I'd model the kind of typo I make most: trying to hit one key, but hitting a key either to the left or to the right. And since most of the keys I hit are letters, I decided to ignore typos on other keys, like hitting "%" instead of "$", or even shift typos -- typing "THe" for "The".

The first thing any typo-simulating program needs to know is what keys are next to what. So I the first thing I wrote was a data table for the keys, @rows, and then a bit of code to expand that into two hashes, %Left and %Right:

  use strict;
  my @rows;
  if (1) {   # change to 0 to get qwerty.
      @rows = (
        # Yes, I use a split keyboard...
        "    py  fgcrl ",
        " aoeui  dhtns ",
        "  qjkx  bmwvz ",
  } else {
      @rows = (
        " qwert  yuiop ",
        " asdfg  hjkl  ",
        " zxcvb  nm    ",

  # To simulate an un-split keyboard:
  #  for (@rows) { substr($_,6,2) = '' }

  my (%Left, %Right);
   # So $Left{$x} is what letter, if any,
   # to the left of the letter $x.
  foreach my $r (@rows) {
      for (my $i = 1; $i < length $r; ++$i) {
           my $x = substr($r,$i,1);
           next unless $x =~ m/[a-z]/;
           $Left{$x}  = substr($r,$i - 1,1)
             unless substr($r,$i - 1,1) eq ' ';
           $Right{$x} = substr($r,$i + 1,1)
             unless substr($r,$i + 1,1) eq ' ';
  # And add the uppercase letters:
  %Left  = (%Left,  map uc($_), %Left);
  %Right = (%Right, map uc($_), %Right);

Then, after some tinkering, I came up with a function that, given a word, would try to think of some way to make a typo in it:

  sub typo_on_word {
      my $word = $_[0];
      my $typo_word;
      my $tries = 0;
      if (++$tries > 4) {
          # after too many do-overs, give up
          $typo_word = $word;
          last Make_typo;
      my @strokes = stroke_groups($word);
      my $where = int rand @strokes;
      my $char = substr($strokes[$where],0,1);
      my $instead = (rand(1) < .5)
        ? ($Left{$char}  || $Right{$char} || redo)
        : ($Right{$char} || $Left{$char}  || redo);
      $strokes[$where] = $instead
                         x length $strokes[$where];
       # So 'e' => 'r' or 'w', 'ee' => 'rr' or 'ww'
      $typo_word = join '', @strokes;
      redo Make_typo unless rep_pattern($word)
        eq rep_pattern($typo_word);

      # That's so that we don't create any stroke
      # groups that weren't there before, as in
      # turning "soar" into "soor", which is a
      # kind of mistake that I rarely if ever make.
    return $typo_word;
  sub stroke_groups {
      #  'eat'  => qw(e a t)
      #  'eel'  => qw(ee l)
      #  'fool' => qw(f oo l)
      my @out;
      while ($_[0] =~ m<(.)(\1*)>g) {
          push @out, $1 . $2;
      return @out;
  sub rep_pattern {
      #  'eat'  => '1_1_1'
      #  'eel'  => '2_1'
      #  'fool' => '1_2_1'
      join '_',
        map length($_),

Now, there's a lot going on here, so I'll break it down: every word is seen as an array of stroke groups -- where each stroke group is a character plus any immediately following repetitions of itself. So "cat" is three stroke groups, "c", "a", and "t"; but "food" is two: "f", "oo", and "d".

Modeling things based on stroke groups captures the fact that if I miss the first "o" in "food", I'm also going to miss the following "o" the same way. And it also captures the fact that I wouldn't make a typo that would create a new stroke group -- while I could mistype "pen" as "pes", I would not mistype "pens" as "pess" or "penn". So if the typo-generating code tried doing exactly that, turning "pens" into "pess", then rep_pattern($word) eq rep_pattern($typo_word) would be false (rep_pattern of "pens" is "" but rep_pattern of "pess" is "1.1.2"), and the redo would start the block over. (Yes, you can have redos and lasts in non-loop blocks!)

So if we use the above subroutines and then try:

  for (1..15) {
      print typo_on_word("nevermore"), " ";

Run with the Dvorak keymap, you'll get output like this:

  nevecmore nevelmore nuvermore nevermoro severmore
  neverbore nevermole nevurmore nevurmore novermore
  nevecmore nevermare nevermoru nevormore nevermare

And with a QWERTY keymap,

  mevermore nevermote nwvermore nevermorw nevernore
  nevermpre nevermorw nebermore nevwrmore nevwrmore
  nrvermore nwvermore mevermore nevermorw nevermire

Now, these look to me like plausible typos of the sort I've made on Dvoraks and QWERTYs. This is not to say that every possible typo I'd make would be generated by the above typo_on_word function. For example, typo_on_word doesn't attempt to simulate transposition, as in "hten" for "then". Moreover, it fails to account for the fact that I now and then make typos like "moro" for "mere" -- where, in effect, "e-e" functions as a sort of stroke group, because the left hand never leaves its key, regardless of the fact that the right hand is meanwhile off hitting the "r".

But there are diminishing returns to this. I think that if I wrote a function that modeled every kind of typo I make, with the appropriate frequency, it alone would be longer than this article, if not the entire issue, but wouldn't be vastly more realistic than what I hacked together. The exhaustive and exhausting detail that Dvorak's book Typewriting Behavior goes into certainly convinced me of the fact that errors are not simple things. However, typo_on_word does simulate most of the sorts of typos I do make, on each kind of keyboard.

And notice that most of the simulated Dvorak typos for "nevermore" look more or less like plausible (if not actually existing) English words to me, whereas most of the QWERTY typos contain character sequences that no English word could contain, like "nwv", "vwrm", and so on.

How to Tell a Word

Being able to say that the string "tevermore" could be an existing word but "nevwrmore" couldn't be (and maybe that "nevecmore" and "nevermoru" sort-of could be) is something we can do intuitively based on some pretty complex implicit knowledge about how letters (and, at another level, sounds) can co-occur in English. Expressing that knowledge and then teaching it to a computer would be pretty difficult.

However, it's possible to teach the computer to acquire, on its own, a simple model of letter co-occurrence.

Consider the word "nevermore" word as a sequence of overlapping three-character sequences, including, for good measure, enclosing brackets, to stand for the word boundaries:


If we scan a large amount of existing and presumably typo-free text (a corpus), and look at all such three-character clusters (trigraphs), then we'll be able to scrutinize the simulated typo "nevwrmore", and we'll see that it consists of never-before-seen clusters like "evw", "vwr", and "wrm". Then we can note that it's got three things wrong with it, which makes it rather implausible as a word.

First, to build the frequency table:

  my $text = 'babbitt.txt';
  open(TEXT, "<$text")
    or die "Can't read-open $text: $!";
  my %Known_clusters;
  while (<TEXT>) {
      my @words = words_in($_);
      foreach my $w (@words) {
          $w = lc "[$w]";
          for (my $i = 0; $i < length($w) - 2; ++$i) {
              ++$Known_clusters{substr $w, $i, 3};

  sub words_in {
      return " $_[0]" =~
      # return $_[0] =~
      #   m/\s([a-zA-Z]+[a-zA-Z']*)(?=[\s,.])/g ;
      # # See perlfaq6 for more on matching words

This builds a hash, %Known_clusters, where the keys are all the three-letter clusters in all the words in a file. The file I happen to be using is a 700K text file comprising Sinclair Lewis's novel Babbitt, available from Project Gutenberg (

We can test whether a cluster occurred in the text by just testing exists $Known_clusters{$cluster} -- and that's the basis of this routine that gives a measure of the "plausibility" of a word, by simply figuring what proportion of the word's clusters occur in %Known_clusters:

  my $Debug = 1; # set to 0 to make plaus silent
  sub plaus {
      die "don't feed plaus a null string!"
        unless length $_[0];  # sanity checking
      my $w = lc "[$_[0]]";
      my $plaus_count = 0;
      my $cluster_count = 0;
      print "$w: " if $Debug;
      for (my $i = 0; $i < length($w) - 2; ++$i) {
          # Loop over three-character clusters
          if (exists $Known_clusters{substr $w, $i, 3}) {
          } else { 
              print ' <',substr($w, $i, 3),'>?' if $Debug;
      my $p = $plaus_count / $cluster_count; 
      printf " = %0.2f\n", $p if $Debug;
      return $p;

We can test this by giving it two variations on "nevermore", and a few (typo-free) phrases chosen at random from my mail file, and then some random odd-looking words and names from a dictionary:

  foreach my $w (qw(
    nevermore neverbore nwvwrmore

    potatoes cheese power and solidarity
    as metrics in language survey data analysis
    assessing ethnolinguistic vitality it seems to
    me that this homogenization of language parallels
    what took place a couple hundred years ago and
    is still going on

    Tokyo Xhosa Zanzibar yoghurt amphioxis
    Kleenex Yaqui quetzal
  )) {
      # Since we're in debug mode, just figuring
      # out plaus will print things.

This processes all the above words, noting three-letter clusters not found in the most frequent half of the clusters in Babbitt, and figuring the score (which is just the proportion of clusters which were known. All of the words get straight 1.0's (i.e., all clusters known), except for these:

  [nwvwrmore]:  <[nw>? <nwv>? <wvw>? <vwr>? <wrm>? = 0.44
  [tokyo]:  <yo]>? = 0.80
  [xhosa]:  <[xh>? <xho>? = 0.60
  [zanzibar]:  <[za>? <anz>? <nzi>? <zib>? = 0.50
  [yoghurt]:  <ghu>? = 0.86
  [amphioxis]:  <iox>? = 0.89
  [kleenex]:  <[kl>? = 0.86
  [yaqui]:  <yaq>? <ui]>? = 0.60
  [quetzal]:  <etz>? <tza>? <zal>? = 0.57

So, for example, "neverbore" consists entirely of clusters seen in Babbitt. (The near-rarest cluster, incidentally, is "rbo", but that appears in "carbon", "Arbor", "Bourbon", and a few other words in the Babbitt corpus.) But "nwvwrmore" gets a very low rating from plaus because it contains all sorts of clusters that don't appear anywhere in Babbitt: "word-start n w", "n w v", and so on.

The words from "Tokyo" on, are all marked as somewhat implausible; while they are all either English words, or existing names usable in English sentences, plaus has no way to know that. But note that "nwvwrmore", with a plausibility of 0.44, scores much lower than any of these. So plaus does a pretty good job of being able to tell gibberish from the "background radiation" of merely odd words and names.

Now, to test it on the "nevermore" typos we simulated in the previous section:

  sub avg_plaus {
      my @words = @_;
      return undef unless @words;
      my $plaus_sum = 0;
      foreach my $w (@words) {
          $plaus_sum += plaus($w);
      return($plaus_sum / @words);

  print "Dvorak 'nevermore' typo plaus: ",
     nevecmore nevelmore nuvermore nevermoro severmore
     neverbore nevermole nevurmore nevurmore novermore
     nevecmore nevermare nevermoru nevormore nevermare
   }), "\n";
  print "QWERTY 'nevermore' typo plaus: ",
     mevermore nevermote nwvermore nevermorw nevernore
     nevermpre nevermorw nebermore nevwrmore nevwrmore
     nrvermore nwvermore mevermore nevermorw nevermire
   }), "\n";

This returns:

  Dvorak 'nevermore' typo plaus: 0.955555555555556
  QWERTY 'nevermore' typo plaus: 0.851851851851852

So plaus's simple algorithm captures our observation that the simulated QWERTY typos on "nevermore" are more gibberish-like than the simulated Dvorak typos.1

But that's just one word -- a real test of this would be to simulate typos in a real text. We can deal with any amount of text (either in files named on the command line, or piped via STDIN), and tries to make a typo in every word, and then reports the average plausibility (via plaus) of the typo-ridden words in the text:

my (@typo_words);
  while (<>) {
      foreach my $w (words_in($_)) {
          push @typo_words, typo_on_word($w);
  print "Typo plaus: ",  avg_plaus(@typo_words), "\n";
  print "Input words: ", scalar(@typo_words),    "\n";
  print "Typo plaus: ",  avg_plaus(@typo_words), "\n";
  print "Input words: ", scalar(@typo_words),    "\n";
  print "Start of typo text: ", join(' ',
                                      (@typo_words > 100)
                    ? @typo_words[0 .. 100] : @typo_words
  ), "\n";

When we feed text through this program, we get (after some minutes of frenzied calculation) a report of the average plaus rating for the simulated typos in the text. We also get to see the beginning of the typo-filled text.

Typo-free Babbitt starts out:

 The towers of Zenith aspired above the morning mist;
 austere towers of steel and cement and limestone, 
 sturdy as cliffs and delicate as silver rods.

But the above program, simulating typos on a split Dvorak keymap, gives us:

 Thu nowers af Venith aspured abowe tho mornisg bist; 
 austece tomers og sheel anh cument ond liwestone, 
 sturhy an criffs anh dericate an nilver rodn.

And for a split QWERTY, we get:

 Rhe rowers pf Zenirh asoired sbove tje mirning nist;
 ausrere rowers pf steek amd cemenr amd limestonw,
 srurdy ad clidds anf delicare as dilver rids.

The average plaus of the whole of Babbitt, all 115,826 words of it, is about 0.87 for simulated Dvorak typos, but only 0.75 for simulated QWERTY typos.

There may be something a bit odd about using the same text to simulate typos as the %Known_clusters was built from; but it turns out that if we use the %Known_clusters from Babbitt but simulate typos on other texts (here, a 48,000-word Project Gutenberg e-text of Charles Babbage's Reflections on the Decline of Science in England, and on Some of its Causes; and the first few paragraphs of William Gibson's Neuromancer), we find that the average plaus ratings are basically the same as for Babbitt!

Errors typed on a Dvorak, at least as modeled by my simulator, seem to be consistently more plausible (looking less like errors and more like real words) than errors on a QWERTY -- at least for English text.

Typos in Other Languages

I was wondering, however, to what degree this might be specific to typing just in English. After all, both the Dvorak and QWERTY keymaps were designed with only English in mind, although both (with some degree of modification) are used for typing in any language that uses the Roman alphabet.

Now, simulating typos in typing another language begs the question of exactly what keymap is used -- languages with lots of accents have to add or alter the Dvorak or QWERTY keymaps to accommodate typing those accents. To keep things simple, I decided to try text in Dutch, a language with few accents. (I do wonder how Polish typos would come out on a QWERTY and a Dvorak, but I know of no Dvorak keymaps that support Polish accents.)

A quick trip over to the European Parliament's web site ( got me about 22,000 Dutch words: the text of four days' worth of the Dagelijks Presbericht, the EP Daily Notebook. An example phrase, with Dvorak and QWERTY typos:

    Maar met twee amendementen wordt er bij de Raad nogmaals op
 D: Moor mot hwee amendomenten mordt el mij du raah sogmaals ap
 Q: Naar net rwee amensementen wirdt wr bih dr rssd nognaals ip

The results over the mini-corpus of Dutch was comparable to the English results: the average plaus on Dvorak was about 0.82, and on QWERTY it was about 0.72. So the average typo on each for Dutch was a bit less plausible than for English, although interestingly enough, the difference (about 0.10) remains the same.

But then, Dutch is a Germanic language like English, with similar restrictions on how many consonants you can pack into each syllable (relatively many when compared to most other languages). A typical Italian syllable, however, is just a consonant and a vowel, and possibly a consonant at the end. So, to see how Italian would work with Dvorak and QWERTY typos, I rebuilt %Known_clusters from the clusters in Dante's Inferno, and then simulated typos on the text. The text, with typos, starts out:

    Nel mezzo del cammin di nostra vita
 D: Ner mevvo dol commin hi sostra zita
 Q: Nwl nezzo dek cammim si nostrs bita

    mi ritrovai per una selva oscura
 D: wi ritrozai pel uno sulva oscira
 Q: ni rotrovai oer yna sekva oscurs

    che' la diritta via era smarrita.
 D: cho' lo duritta vio ora nmarrita.
 Q: cje' ka dititta vua eta amarrita.

    Ahi quanto a dir qual era e` cosa dura
 D: Ahu quanta o hir jual eca o` casa hura
 Q: Shi quamto s fir wual wra w` cisa dira

    esta selva selvaggia e aspra e forte
 D: esto selvo selvoggia u asyra o ferte
 Q: eata swlva sekvaggia w asprs w fprte

    che nel pensier rinova la paura!
 D: ghe ner pensuer rinovo ra paira!
 Q: xhe nek prnsier riniva ls psura!

("Midway upon the road of our life I found myself within a dark wood, for the right way had been missed. Ah! how hard a thing it is to tell what this wild and rough and dense wood was, which in thought renews the fear!" -- from the Norton translation, also available from Project Gutenberg.)

Simulating Dvorak typos on Inferno (about 30,000 words) gives an average plaus of about 0.81, like Dutch, and not far off from English's 0.88. But QWERTY typos have a much lower plaus: 0.61. The plaus figures are the same with Paradiso (also about 30,000 words).

Just to see if I could throw a wrench into the works, I decided to try feeding through some texts in written Tibetan (in Romanization). While spoken Tibetan is pretty normal as languages go, written Tibetan has (silent) consonants in patterns and quantities I'd never have thought possible. (See [Beyer 1992] for a fascinating discussion of how the writing system got to be that way.) Luckily for my purposes, the Asian Text Input Project ( has megs and megs of ASCII text in Tibetan. I decided at random on an 833KB file called 'Phags Pa Rgya Cher Rol Pa Zhes Bya Ba Theg Pa Chen Po'i Mdo (The Exalted Sutra of the Greater Way entitled The Sutra of Cosmic Play, or Arya Lalitavistara Nama Mahayanasutra).

Figure 1. A line of Tibetan text.

Here is a sample (typo-free!) line from the Tibetan text (with the actual phrase shown above in Figure 1), with simulated Dvorak-typo and QWERTY-typo versions:

    gcig na, bcom ldan 'das mnyan yod na rgyal bu rgyal
 D: gcug no, bcow ldon 'dan bnyan yad no rgyar bi cgyal
 Q: fcig ns, bcon lsan 'fas mnyam uod ns rfyal bi rfyal

You'd think that a language that admits "rgyal" as a syllable isn't too terribly choosy about syllable structure -- since "gcig" is a word, you'd bet "gcug" and "fcig" are just as plausible as words.

But you'd be wrong. Simulating typos on Tibetan text gives results not far from typos on the other languages' texts: the Tibetan text's average plaus for a split Dvorak keymap is 0.80, a few points below the 0.82 for Italian, but well above the average plaus score of just 0.59 for QWERTY-typo'd Tibetan.

The principle at work seems to be that on a Dvorak, if you miss a vowel, you'll probably get another vowel, and similarly for consonants. Moreover, there's a decent likelihood you'll get a consonant of the same articulatory class: most of bottom-right on a Dvorak ("bmwvz" -- "z" being the odd man out) is letters whose typical values are sounds articulated with the lips, and most of the middle-right row ("dhtns" -- "h" being the exception this time) are sounds articulated with the tongue-tip right behind the top front teeth. Substituting one of these for another of the same class typically will give you a plausible word.

On a QWERTY keyboard, however, there is relatively little such phonetic patterning of the keys, and so being one key off will get you a letter with basically no relationship to the letter you were aiming for.

While I find typing on a Dvorak to make for less work (muscularly) than typing on a QWERTY, the typos will stick out less, apparently regardless of language. So using a Dvorak means that careful proofreading has have to be even more careful -- at least until someone writes a use strict pragma for Tibetan, Italian, Dutch, and maybe even English.


Saen M. Burek si ruolly a vrey oogd typsit. Arr og hsi .orl Qaernal achigres oru gobpletely glee af p.aoes mden he sobmets ntem.


Beyer, Stephan V. 1992. The Classical Tibetan Language. State University of New York Press, Albany.

Dvorak, August, Nellie L. Merrick, William L. Dealey, and Gertrude Catherine Ford. 1936. Typewriting Behavior. American Book Company, New York City. [Out of print and rather hard to find. -- SB]

Plausibility of simulated typos.

The average plausibility of simulated typos, on different keymaps, for texts in various languages.

Dvorak QWERTY  
Split Unsplit Split Unsplit  
.874 .864 .756 .749 Sinclair Lewis's Babbitt
.874 .865 .773 .757 Charles Babbage's Reflections on the Decline of Science in England, and on Some of its Causes (plaus based on Babbitt)
.885 .863 .770 .766 First few paragraphs of William Gibson's Neuromancer (plaus based on Babbitt)
.836 .828 .724 .692 Dutch: Dagelijks Presbericht 2000-10-24 .831 .821 .715 .686 Dagelijks Presbericht 2000-10-23, 2000-10-25, and 2000-10-26 (plaus based on 2000-10-24)
.821 .806 .616 .600 Italian: Dante's Inferno
.821 .804 .612 .604 Dante's Paradiso (plaus based on Inferno)
.804 .754 .585 .607 Tibetan: 'Phags Pa Rgya Cher Rol Pa Zhes Bya Ba Theg Pa Chen Po'i Mdo [Sutra of Cosmic Play]

[Back to list of articles]