Wednesday, August 29, 2007

Spelling Corrector in 21 Lines of Python

A friend recently pointed out a very interesting article from Peter Norvig, the Director of Research at Google: How to Write a Spelling Corrector. The point of the article was to illustrate some spell-correction theory, but of course what I took away from it was that he had written a working spelling corrector in just 21 lines of Python code. Ammo for the language wars!

My friend also mentioned that he often mistypes things because his hands are offset one position on the keyboard. So he’ll type “je;;p” instead of “hello”. Since spelling correction is so easy, why do none of them offer a way to correct offset typing, he wondered? I told him he was probably just in a very small minority. I doubt that many web users touch type at all, let alone have the offset typing problem. But more importantly, here was an opportunity to ram home the point that so much is possible in so very few lines of Python.

So I took a swing at a solution, and five lines of code is what it took me to add offset-typing correction to Norvig’s spelling corrector:

offsets_right = {'j' : 'h', 'i' : 'u', ';' : 'l', 'p' : 'o'}
offsets_left = {'g' : 'h', 'y' : 'u', 'k' : 'l', 'i' : 'o'}
def offsets(word):
return set([''.join(map(lambda x: offsets_right.get(x, 'a'), word)),
''.join(map(lambda x: offsets_left.get(x, 'a'), word))])

And then you just have to tweak the final function:

def correct(word):
return max(known([word]) or known(edits1(word)) or known_edits2(word)
or known(offsets(word)) or [word],
key=lambda w: NWORDS[w])

This solution checks for words typed with a left or right offset. For example:

>>> correct('hillo')
'hullo'
>>> correct('ji;;p')
'hullo'
>>> correct('gykki')
'hullo'

Notes:
1. I use ‘hullo’ instead of ‘hello’ since the latter never appears in the Sherlock Holmes stories used as a training corpus.
2. The offset dictionaries only have 4 entries, to cover the four letters in the example. The complete solution would need to have entries for every non-meta key on the keyboard (except for the ones on the left or right edge).
3. My solution is a bit more complicated than I’d like. What I wanted to write was:
map(offsets_right.get, word)
Which is nice and clear but for various reasons I had to use:
''.join(map(lambda x: offsets_right.get(x, 'a'), word))
There’s probably a nice Python solution that would be as terse but more readable.

Here’s the complete solution:

import re, string, collections

def words(text): return re.findall('[a-z]+', text.lower())

def train(features):
model = collections.defaultdict(lambda: 1)
for f in features:
model[f] += 1
return model

NWORDS = train(words(file('holmes.txt').read()))

def edits1(word):
n = len(word)
return set([word[0:i]+word[i+1:] for i in range(n)] + ## deletion
[word[0:i]+word[i+1]+word[i]+word[i+2:] for i in range(n-1)] + ## transposition
[word[0:i]+c+word[i+1:] for i in range(n) for c in string.lowercase] + ## alteration
[word[0:i]+c+word[i:] for i in range(n+1) for c in string.lowercase]) ## insertion

def known_edits2(word):
return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)

offsets_right = {'j' : 'h', 'i' : 'u', ';' : 'l', 'p' : 'o'}
offsets_left = {'g' : 'h', 'y' : 'u', 'k' : 'l', 'i' : 'o'}

def offsets(word):
return set([''.join(map(lambda x: offsets_right.get(x, 'a'), word)),
''.join(map(lambda x: offsets_left.get(x, 'a'), word))])

def correct(word):
return max(known([word]) or known(edits1(word)) or known_edits2(word)
or known(offsets(word)) or [word],
key=lambda w: NWORDS[w])

No comments: