# Time-stamp: "2000-10-31 02:07:46 MST" =head1 Simulating Typos with Perl I =head2 Sean M. Burke [Eds: This article seems a bit longish, but much of it is bits of text, like the Dante quote, that one looks at rather than reads. And, incidentally, those can be trimmed if you think it's a overlong. I figure that, if nothing else, the reader will learn that the Dvorak keymap exists, Tibetan is strange, you can get lots of odd text on the Net, and that one can "redo" even a (named) non-loop block. ] About two years ago, I switched to typing on with Dvorak keymap. That meant going from the Sholes "QWERTY" keymap: ` 1 2 3 4 5 6 7 8 9 0 - = \ q w e r t y u i o p [ ] a s d f g h j k l ; ' z x c v b n m , . / to August Dvorak's more efficiency-minded keymap: ` 1 2 3 4 5 6 7 8 9 0 [ ] \ ' , . p y f g c r l / = a o e u i d h t n s - ; q j k x b m w v z It was just a matter of switching the keymap preferences on whatever computers I had to type on, and then a few days of acclimatizing to all the keys having moved. This had the two desired effects: my hands would no longer ache after marathon coding sessions; and no-one else ever touched my computer again. But there was one side effect I hadn't anticipated: a different keymap means different kinds of typos. This became evident to me first on IRC -- since IRC is a medium characterized by people typing faster than they can think, typos abound: I hear it's out on video now me> I know, I sow it a wook age. sow? wook age? GWAWRR! BEWARE THE AGE OF THE WOOK! me> I mean I sAw it a wEEk agO. guh, how do you manage to aim for 'e' and hit 'o' instead? they're on different sides of the keyboard me> They're right next to eachother on mine. I use a Dvorak keyboard. The middle row goes: "aoeuidhtns". that's because you're a communist me> columnist yea like dvorak me> different Dvorak. August, not John. whatEVERRRR i like pie [Re-wrap as needed, of course.] Over time I did get the feeling that typos on a Dvorak were really consistently different. At least for me, the typos I'd made on a QWERTY were either transposition ("hte" for "the") or hitting a key adjacent to the one meant. On a Dvorak, transposition errors are more or less the same, but adjacent-key errors are, naturally, rather different -- if you miss to the left or right of a QWERTY "e", you hit "w" or "r", but to the left or right of a Dvorak "e" is "o" and "u". So the equivalents of "fwlt" or "frlt" on a QWERTY become "folt" or "fult" on a Dvorak. I had the feeling that Dvorak typos were, on the whole, much less likely to "look like typos", compared to QWERTY typos. Whereas "fwlt" and "frlt" couldn't possibly be words, "folt" and "fult" look like plausible words that happen not to exist. And sometimes the typo does make for an existing word -- one off from "seen" is "soon", one off from "be" is "me", and so on. This isn't something completely exclusive to a Dvorak -- on a QWERTY, "fear" and "dear" are just one key off -- but I had a feeling it was happening much more frequently with the Dvorak. Now, looking at the keymap, it sort of stands to reason; but then, lots of things stand to reason that don't actually happen (like, say, everyone abandoning the QWERTY keymap, or having done so decades ago). So I decided that the best way to test this would be to write some sort of program to simulate typos on a Dvorak and on a QWERTY, have it generate lots and lots and lots of typos, and see what the results would be. =head2 Simulating the Typos For sake of simplicity, I figure I'd model the kind of typo I make most: trying to hit one key, but hitting a key either to the left or to the right. And since most of the keys I hit are letters, I decided to ignore typos on other keys, like hitting "%" instead of "$", or even shift typos -- typing "THe" for "The". The first thing any typo-simulating program needs to know is what keys are next to what. So I the first thing I wrote was a data table for the keyb, C<@rows>, and then bit of code to expand that into two hashes, C<%Left> and C<%Right>: use strict; my @rows; if(1) { # change to 0 to get qwerty. @rows =( # Yes, I use a split keyboard... " py fgcrl ", " aoeui dhtns ", " qjkx bmwvz ", ); } else { @rows =( " qwert yuiop ", " asdfg hjkl ", " zxcvb nm ", ); } # To simulate an un-split keyboard: # for(@rows) { substr($_,6,2) = '' } my(%Left, %Right); # So $Left{$x} is what letter, if any, # to the left of the letter $x. foreach my $r (@rows) { for(my $i = 1; $i < length $r; ++$i) { my $x = substr($r,$i,1); next unless $x =~ m/[a-z]/; $Left{$x} = substr($r,$i - 1,1) unless substr($r,$i - 1,1) eq ' '; $Right{$x} = substr($r,$i + 1,1) unless substr($r,$i + 1,1) eq ' '; } } # And add the uppercase letters: %Left = (%Left, map uc($_), %Left); %Right = (%Right, map uc($_), %Right); Then, after some tinkering, I came up with a function that, given a word, would try to think of some way to make a typo in it: sub typo_on_word { my $word = $_[0]; my $typo_word; my $tries = 0; Make_typo: { if(++$tries > 4) { # after too many do-overs, give up $typo_word = $word; last Make_typo; } my @strokes = stroke_groups($word); my $where = int rand @strokes; my $char = substr($strokes[$where],0,1); my $instead = (rand(1) < .5) ? ($Left{$char} || $Right{$char} || redo) : ($Right{$char} || $Left{$char} || redo); $strokes[$where] = $instead x length $strokes[$where]; # So 'e' => 'r' or 'w', 'ee' => 'rr' or 'ww' $typo_word = join '', @strokes; redo Make_typo unless rep_pattern($word) eq rep_pattern($typo_word); # That's so that we don't create any stroke # groups that weren't there before, as in # turning "soar" into "soor", which is a # kind of mistake that I rarely if ever make. } return $typo_word; } sub stroke_groups { # 'eat' => qw(e a t) # 'eel' => qw(ee l) # 'fool' => qw(f oo l) my @out; while($_[0] =~ m<(.)(\1*)>g) { push @out, $1 . $2; } return @out; } sub rep_pattern { # 'eat' => '1_1_1' # 'eel' => '2_1' # 'fool' => '1_2_1' join '_', map length($_), stroke_groups($_[0]); } Now, there's a lot going on here, so I'll break it down: every word is seen as an array of stroke groups -- where each stroke group is a character plus any immediately following repetitions of itself. So "cat" is three stroke groups, "c", "a", "t"; but "food" is two: "f", "oo", "d". Modeling things based on stroke groups captures the fact that if I miss the first "o" in "food", I'm also going to miss the following "o" the same way. And it also captures the fact that I wouldn't make a typo that would create a new stroke group -- while I could mistype "pen" as "pes", I would I mistype "pens" as "pess" or "penn". So if the typo-generating code tried doing exactly that, turning "pens" into "pess", then C would be false (C of "pens" is "1.1.1.1" but C of "pess" is "1.1.2"), and the C would start the block over. (Yes, you can have redos and lasts in non-loop blocks!) [Eds: Some of my explanation in the above two paragraphs duplicates explanation in the comments in the code block. Trim if you like. ] So if we use the above subs and then try: for(1 .. 15) { print typo_on_word("nevermore"), " "; } Run with the Dvorak keymap, you'll get output like this: nevecmore nevelmore nuvermore nevermoro severmore neverbore nevermole nevurmore nevurmore novermore nevecmore nevermare nevermoru nevormore nevermare And with a QWERTY keymap, mevermore nevermote nwvermore nevermorw nevernore nevermpre nevermorw nebermore nevwrmore nevwrmore nrvermore nwvermore mevermore nevermorw nevermire First off, these look to me like plausable typos of the sort I've made on Dvoraks and QWERTYs. This is not to say that every possible typo I'd make would be generated by the above C function. For example, C doesn't attempt to simulate transposition, as in "hten" for "then". Moreover, it fails to account for the fact that I now and then make typos like "moro" for "mere" -- where, in effect, "e-e" functions as a sort of stroke group, because the left hand never leaves its key, regardless of the fact that the right hand is meanwhile off hitting the "r". But, there's diminishing returns to this; I think that if I wrote a function that modeled I kind of typo I make, with the appropriate frequency, it alone would be longer than this article (if not this whole issue), but wouldn't be I more realistic than what I hacked together. The exhaustive and exhausting detail that Dvorak's book I goes into certainly convinced me of the fact that errors are not simple things. However, C does simulate I of the sorts of typos I do make, on each kind of keyboard. And notice that most of the simulated Dvorak typos for "nevermore" look more or less like plausable (if not actually existing) English words to me, whereas most of the QWERTY typos contain character sequences that no English word word could contain, like "nwv", "vwrm", etc. =head2 How to Tell a Word Being able to say that the string "tevermore" I be an existing word but "nevwrmore" I be (and maybe that "nevecmore" and "nevermoru" sort-of could be) is something we can do intuitively based on some pretty complex implicit knowledge about how letters (and, at another level, sounds) can co-occur in English. Managing to express that knowledge and then teaching it to a computer would be pretty difficult. However, it's possible to teach the computer to acquire, on its own, a simple model of letter co-occurrence. Consider the word "nevermore" word as a sequence of overlapping three-character sequences, including, for good measure, enclosing brackets, to stand for the word boundaries: [nevermore] [ne nev eve ver erm rmo mor ore re] If we scan a large amount of existing and presumably typo-free text (a corpus), and look at all such three-character clusters (trigraphs), then we'll be able to scrutinize the simulated typo "nevwrmore", and we'll see that it consists of never-before-seen clusters like "evw", "vwr", and "wrm". Then we can note that it's got three things wrong with it, which makes it rather implausable as a word. First, to build the frequency table: my $text = 'babbitt.txt'; open(TEXT, "<$text") or die "Can't read-open $text: $!"; my %Known_clusters; while() { my @words = words_in($_); foreach my $w (@words) { $w = lc "[$w]"; for(my $i = 0; $i < length($w) - 2; ++$i) { ++$Known_clusters{substr $w, $i, 3}; } } } close(TEXT); sub words_in { return " $_[0]" =~ m/\s([a-zA-Z]+[a-zA-Z']*)(?=[\s,.`?!;])/g; #return $_[0] =~ # m/\s([a-zA-Z]+[a-zA-Z']*)(?=[\s,.])/g ; # # See perlfaq6 for more on matching words } This builds a hash, C<%Known_clusters>, where the keys are all the three-letter clusters in all the words in a file. The file I happen to be using is a 700K text file comprising Sinclair Lewis's novel I, available from Project Gutenberg (gutenberg.net). We can test whether a cluster occurred in the text by just testing C -- and that's the basis of this routine that gives a measure of the "plausability" of a word, by simply figuring what proportion of the word's clusters occur C<%Known_clusters>: my $Debug = 1; # set to 0 to make plaus silent sub plaus { die "don't feed plaus a null string!" unless length $_[0]; # sanity checking my $w = lc "[$_[0]]"; my $plaus_count = 0; my $cluster_count = 0; print "$w: " if $Debug; for(my $i = 0; $i < length($w) - 2; ++$i # Loop over three-character clusters ) { ++$cluster_count; if(exists $Known_clusters{substr $w, $i, 3}) { ++$plaus_count; } else { print ' <', substr($w, $i, 3), '>?' if $Debug; } } my $p = $plaus_count / $cluster_count; printf " = %0.2f\n", $p if $Debug; return $p; } We can test this by giving it two variations on "nevermore", and a few (typo-free) phrases chosen at random from my mail file, and then some random odd-looking words and names from a dictionary: foreach my $w (qw( nevermore neverbore nwvwrmore potatoes cheese power and solidarity as metrics in language survey data analysis assessing ethnolinguistic vitality it seems to me that this homogenization of language parallels what took place a couple hundred years ago and is still going on Tokyo Xhosa Zanzibar yoghurt amphioxis Kleenex Yaqui quetzal )) { plaus($w); # Since we're in debug mode, just figuring # out plaus will print things. } exit; This processes all the above words, noting three-letter clusters not found in the most frequent half of the clusters in I), and figuring the score (which is just the proportion of clusters which were known). All of the words get straight 1.0's (i.e., all clusters known), except for these: [nwvwrmore]: <[nw>? ? ? ? ? = 0.44 [tokyo]: ? = 0.80 [xhosa]: <[xh>? ? = 0.60 [zanzibar]: <[za>? ? ? ? = 0.50 [yoghurt]: ? = 0.86 [amphioxis]: ? = 0.89 [kleenex]: <[kl>? = 0.86 [yaqui]: ? ? = 0.60 [quetzal]: ? ? ? = 0.57 So, for example, "neverbore" consists entirely of clusters seen in I. (The near-rarest cluster, incidentally, is "rbo", but that appears in "caBn", "ABr", "BouBn", and a few other words in the I corpus.) But "nwvwrmore" gets a very low rating from C because it contains all sorts of clusters that don't appear anywhere in Babbitt: "I n w", "n w v", etc. The words from "Tokyo" on, are all marked as somewhat implausable; while they I all either English words, or existing names usable in English sentences, C has no way to know that. But note that "nwvwrmore", with a plausability of .44, scores much lower than any of these. So C does a pretty good job of being able to tell gibberish from the "background radiation" of merely odd words and names. Now, to test it on the "nevermore" typos we simulated in the previous section: sub avg_plaus { my @words = @_; return undef unless @words; my $plaus_sum = 0; foreach my $w (@words) { $plaus_sum += plaus($w); } return($plaus_sum / @words); } print "Dvorak 'nevermore' typo plaus: ", avg_plaus(qw{ nevecmore nevelmore nuvermore nevermoro severmore neverbore nevermole nevurmore nevurmore novermore nevecmore nevermare nevermoru nevormore nevermare }), "\n"; print "QWERTY 'nevermore' typo plaus: ", avg_plaus(qw{ mevermore nevermote nwvermore nevermorw nevernore nevermpre nevermorw nebermore nevwrmore nevwrmore nrvermore nwvermore mevermore nevermorw nevermire }), "\n"; This returns: Dvorak 'nevermore' typo plaus: 0.955555555555556 QWERTY 'nevermore' typo plaus: 0.851851851851852 So C's simple algorithm does capture our observation that the simulated QWERTY typos on "nevermore" are more gibberish-like than the simulated Dvorak typos. [ FOOTNOTE: The only unknown clusters in the Dvorak nevermores were: nevBre Bermore nevermoB. However, in the QWERTY nevermores, there were: Bermore Brmore nevermoB neveBre nBore Bermore. ] But that's just one word -- a real test of this would be to simulate typos in a real text. We can deal with any amount of text (either in files named on the command line, or piped on STDIN), and tries to make a typo in every word, and then reports the average plausibility (via C) of the typo-ridden words in the text: my(@typo_words); while(<>) { foreach my $w (words_in($_)) { push @typo_words, typo_on_word($w); } } print "Typo plaus: ", avg_plaus(@typo_words), "\n"; print "Input words: ", scalar(@typo_words), "\n"; print "Typo plaus: ", avg_plaus(@typo_words), "\n"; print "Input words: ", scalar(@typo_words), "\n"; print "Start of typo text: ", join(' ', (@typo_words > 100) ? @typo_words[0 .. 100] : @typo_words ), "\n"; When we feed text through this program, we get (after some minutes of frenzied calculation) a report of the average C rating for the simulated typos in the text. We also get to see the beginning of the typo'd text. Typo-free I starts out: I But the above program, simulating typos on a split Dvorak keymap, gives us: I And for a split QWERTY, we get: I The average C of the whole of I, all 115,826 words of it, is about .87 for simulated Dvorak typos, but only .75 for simulated QWERTY typos. There may be something a bit odd about using the same text to simulate typos on as the C<%Known_clusters> was built from; but it turns out that if we use the C<%Known_clusters> from I but simulate typos on other texts (here, a 48,000-word Project Gutenberg e-text of Charles Babbage's I; and the first few paragraphs of William Gibson's I), we find that the average C ratings are basically the same as for I! Errors typed on a Dvorak, at least as modeled by my simulator, seem to be consistently more plausible (looking less like errors and more like real words) than errors on a QWERTY -- at least for English text. =head2 Typos in Other Languages I was wondering, however, to what degree this might be specific to typing just in English. After all, both the Dvorak and QWERTY keymaps were designed with only English in mind, although both (with some degree of modification) are used for typing in any language that uses the Roman alphabet. Now, simulating typos in typing another language begs the question of exactly what keymap is used -- languages with lots of accents have to add or alter the Dvorak or QWERTY keymaps to accomodate typing those accents. To keep things simple, I decided to try text in Dutch, a language relatively without accents. (I do wonder how Polish typos would come out on a QWERTY and a Dvorak, but I know of no Dvorak keymaps that support Polish accents.) A quick trip over to the European Parliament's web site (www.europarl.eu.int) got me about 22,000 words of text in Dutch (the text of four days' worth of the I (EP Daily Notebook). An example phrase, with Dvorak and QWERTY typos: Maar met twee amendementen wordt er bij de Raad nogmaals op D: Moor mot hwee amendomenten mordt el mij du raah sogmaals ap Q: Naar net rwee amensementen wirdt wr bih dr rssd nognaals ip The results over the mini-corpus of Dutch was comparable to the English results: the average C on Dvorak was about .72, and on QWERTY it was about .82. So the average typo on each for Dutch was a bit less plausable than for English, although interestingly enough, the difference (about .10) remains the same. But then, Dutch is a Germanic language like English, with similar restrictions on how many consonants you can pack into each syllable (i.e., relatively a lot, compared to most other languages). A typical Italian syllable, however, is just a consonant and a vowel, and possibly a consonant at the end. So, to see how Italian would work with Dvorak and QWERTY typos, I rebuilt C<%Known_clusters> from the clusters in Dante's I, and then simulated typos on the text. The text, with typos, starts out: Nel mezzo del cammin di nostra vita D Ner mevvo dol commin hi sostra zita Q Nwl nezzo dek cammim si nostrs bita mi ritrovai per una selva oscura D wi ritrozai pel uno sulva oscira Q ni rotrovai oer yna sekva oscurs che' la diritta via era smarrita. D cho' lo duritta vio ora nmarrita. Q cje' ka dititta vua eta amarrita. Ahi quanto a dir qual era e` cosa dura D Ahu quanta o hir jual eca o` casa hura Q Shi quamto s fir wual wra w` cisa dira esta selva selvaggia e aspra e forte D esto selvo selvoggia u asyra o ferte Q eata swlva sekvaggia w asprs w fprte che nel pensier rinova la paura! D ghe ner pensuer rinovo ra paira! Q xhe nek prnsier riniva ls psura! [next paragraph in small type, I think:] ("Midway upon the road of our life I found myself within a dark wood, for the right way had been missed. Ah! how hard a thing it is to tell what this wild and rough and dense wood was, which in thought renews the fear!" -- from the Norton translation, also available from Project Gutenberg.) Simulating Dvorak typos on I (about 30,000 words) gives an average C of about .81, like Dutch, and not far off from English's .88. But QWERTY typos have a much lower C: .61. The C figures are the same with I (also about 30,000 words). Just to see if I could throw a wrench into the works, I decided to try feeding through some texts in written Tibetan (in Romanization). While spoken Tibetan is pretty normal as languages go, written Tibetan has (silent) consonants in patterns and quantities I'd never have thought possible. (See Beyer 1992 for a fascinating discussion of how the writing system got to be that way.) Luckily for my purposes, the Asian Text Input Project (asianclassics.org) has megs and megs of ASCII text in Tibetan. I decided at random on an 833KB file called I<'Phags Pa Rgya Cher Rol Pa Zhes Bya Ba Theg Pa Chen Po'i Mdo> (i.e., I, or I). Here is a sample (typo-free!) line from the Tibetan text, with simulated Dvorak-typo and QWERTY-typo versions: gcig na, bcom ldan 'das mnyan yod na rgyal bu rgyal byed kyi tsal D: gcug no, bcow ldon 'dan bnyan yad no rgyar bi cgyal byud kpi tsol Q: fcig ns, bcon lsan 'fas mnyam uod ns rfyal bi rfyal bued lyi rsal [If that's too wide, trim after each line's first "rgyal"] [And BTW, Tibetan is so visually interesting that I /am/ looking for a version of the above line rendered in Tibetan script. If I find it, I'll send a GIF or something. I'll see what I can do -- no promises, though.] You'd think that a language that admits "rgyal" as a syllable is just not too terribly choosy about syllable structure -- since "gcig" is a word, then you'd bet "gcug" and "fcig" are just as plausable as words. But you'd be wrong. Simulating typos on Tibetan text gives results not far from typos on the other languages' texts: the Tibetan text's average C for a split Dvorak keymap is .80, I few points below the .82 for Italian, but well above the average C score of just .59 for QWERTY-typo'd Tibetan. The principle at work seems to be that on a Dvorak, if you miss while going to typo a vowel, you'll probably get another vowel, and similarly for consonants. Moreover, there's a decent likelihood you'll get a consonant of the same articulatory class: most of bottom-right on a Dvorak ("bmwvz" -- "z" being the odd man out) is letters whose typical values are sounds articulated with the lips, and most of the middle-right row ("dhtns" -- "h" being the exception this time) are sounds articulated with the tongue-tip right behind the top front teeth. Substituting one of these for another of the same class typically will give you a plausable word. On a QWERTY, however, there is relatively little such phonetic patterning of the keys, and so missing and being one key off will get you a letter with basically no relationship to the letter you were aiming for. While I find typing on a Dvorak to make for less work (muscularly) than typing on a QWERTY, the typos will stick out less, apparently regardless of language. So using a Dvorak means that careful proofreading has have to be even more careful -- at least until someone writes a C pragma for Tibetan, Italian, Dutch, and maybe even English. __END__ Saen M. Burek si ruolly a vrey oogd typsit. Arr og hsi I<.orl Qaernal> achigres oru gobpletely glee af p.aoes mden he sobmets ntem. =head2 References Beyer, Stephan V. 1992. I State University of New York Press, Albany. Dvorak, August, Nellie L. Merrick, William L. Dealey, and Gertrude Catherine Ford. 1936. I American Book Company, New York City. [Out of print and rather hard to find. -- SB] And, for a sidebox, the following table sums things up. Presumably you'll want to lay out in a real table, instead of ASCII art: The average plausability of simulated typos, on different keymaps, for texts in various languages. Dvorak QWERTY Split | Unsplit| Split | Unsplit ========|========|========|========= .874 .864 .756 .749 Sinclair Lewis's I .874 .865 .773 .757 Charles Babbage's I (plaus based on I) .885 .863 .770 .766 First few paragraphs of William Gibson's I (plaus based on I) --------|--------|--------|--------- .836 .828 .724 .692 Dutch: I 2000-10-24 .831 .821 .715 .686 I 2000-10-23, 2000-10-25, and 2000-10-26 (plaus based on 2000-10-24) --------|--------|--------|--------- .821 .806 .616 .600 Italian: Dante's I .821 .804 .612 .604 Dante's I (plaus based on I) --------|--------|--------|--------- .804 | .754 | .585 | .607 Tibetan: I<'Phags Pa Rgya Cher Rol Pa Zhes Bya Ba Theg Pa Chen Po'i Mdo [Sutra of Cosmic Play]> ========|========|========|========= =cut