Spell Checking and Correction
March 24, 2018
Automatic spelling correction in an important and interesting problem in NLP. This article walks you through the approaches and algorithms that can be used to understand automatic spell checking and correction.
Companies like Microsoft and Google have achieved high accuracy in spelling correction, which forms an integral part of their editors, email clients, word processors and search engines. In addition, spell checking is also often used as a pre-processing step in NLP to clean up text data before feeding it to machine learning algorithms for topic classification, search engine ranking, etc.
Spell Checking Illustration in Python
Here is a small python code that does spell checking and suggests the closest correction.
from autocorrect import spell
spell('somthing')
'Something'
Ensure that you have the python package installed.
pip install autocorrect
Spell Check Implementation
The concept behind a spell checker is fairly straight-forward. We start with the Bayes’ theorem. Probability of correction c out of all possible candidate corrections, given the original word w is:
$$ P_{c|w} = \frac{P_{c} \times P_{w|c}}{P_w} $$Since $P_w$ is the same for every possible candidate c, we can factor it out. Thus, we get
$$ P_{c|w} \propto P_{c} \times P_{w|c} $$where
- $P_c$, the probability that c appears as a word of a language and can be found from the language model. For example, occurrences of “the” make up about 7% of English text, so we should have P(the) = 0.07.
- $P_{w|c}$, the error model is the probability that w would be typed in a text when the author meant c. For example, P(teh|the) is relatively high, but P(theeexyz|the) would be very low.
Intuitively, $P_{c|w}$ is effected by two factors. How often does the word c occur overall, and how close are the spellings of w and c.
Example: Consider the misspelled word w=“thew” and the two candidate corrections c=“the” and c=“thaw”. The P(c|w) have to consider both the probability of c and the probability of the change from c to w. For example,
P(the|thew) = P(the) * P(thew|the)
and
P(thaw|thew) = P(thaw) * P(thew|thaw)
For the correction “thaw”, the only change is “a” to “e”, therefore the error model P(thew|thaw) would be high. But on the other hand, since “the” is a very common word, P(the) would be very high. Therefore eventually, the combination of would would decide the closest spell correction.
Next, we’ll see how to estimate each of these values.
Estimating likelihood of a word P(c)
Pc can be estimated by simply counting the number of times each word appears in a text dataset. For example, occurrences of “the” make up about 7% of English text, so we should have P(the) = 0.07.
We need to only store words that appear at-least somewhat commonly. For example, say at least 5 times in the dataset.
Estimating likelihood of a typo P(w|c)
Approach 1: Edit Distance
To understand this part, it is important to first understand what edit distance is between two strings. The edit distance between any two strings is the minimum number of edit operations required to transform string1 to string2. The edit operations could be the insertion, deletion or replacement of a character in the string. For example, the edit distance between boy and bat is 2 and between cats and cat is 1.
The best suggested correction candidates would have the minimal edit distance from the misspelled word. However, there could be various alternative corrections with the same edit distance, for example the correction for “lates” could be “late”, “later” or “latest”, each have an edit distance of 1. There is no way to know for sure the actual replacement candidate and therefore the probabilities Pc would help us decide the best ones having maximum probabilities.
Mathematically, we say P(w|c) = kd(w, c), where k is a constant (say 0.001), and d(w, c) is the edit distance between w and c. This means that if there are two corrections c1 and c2 at edit distance of 1 and 2 from w, then P(w|c1) is 1000 times more than P(w|c2).
To calculate edit distance efficiently, we generate all possible terms with an edit distance <=2 for each query term and search them in a hash-map of all words in the dictionary. It is not required to find edit distance with each word in the dictionary via this approach.
Approach 2: N-Gram Analysis
Another way to find the closest correction for the spelling error is N-gram analysis. In this method, instead of comparing the entire word in a text to a dictionary, just n-grams (characters) are used to find the similarity between strings. An n-gram is a set of consecutive characters taken from a string with a length of whatever n is set to. Let’s understand how to find the similarity coefficient between 2 strings using n-grams with a simple example.
Example: Consider 2 strings “statistics” and “statistical”. If n is set to 2 (i.e., bi-grams are being extracted), then the similarity of the two strings is calculated as follows:
Initially, the two strings are split into bi-grams:
Statistics - st ta at ti is st ti ic cs 9 bigrams
Statistical - st ta at ti is st ti ic ca al 10 bigrams
Then find the unique bi-grams in each string
Statistics - st ta at is ti ic cs (7 unique bigrams)
Statistical - st ta at ti is ic ca al (8 unique bigrams)
Next, find the unique bi-grams that are shared with both the terms.
There are 6 such bi-grams: st ta at ic is ti.
The similarity measure is calculated using similarity coefficient with the following formula:
Similarity coefficient = 2*C/A+B
- A - unique n-grams in term 1.
- B - unique n-grams in term 2.
- C - unique n-grams appearing in term 1 and term 2.
The above example would produce the result (2*6) / (7+8) = 0.80. Higher the similarity measure is, the more relevant is the word for correction.
Again, we have P(w|c) = ks(w, c), where k is a (larger) constant (say 100,000), and s(w, c) is the n-gram similarity between w and c.
Implementing N-grams would need scouting for the word in a large list of word dictionary. To implement this efficiently, we would maintain a hash-map from combinations of n-grams to words in the dictionary.
Rule based
In addition to the approaches we’ve seen so far, we can also use common rules such as grammatical rules, homophones, etc. to detect spell errors and increase the probability of typing one word for another. Examples of such rules would be, (a) the wrong use of indefinite article “an” instead of “a” before a word beginning with a vowel sound, or (b) “write” and “right” used in a wrong way.
Conclusion
In this tutorial, we saw some approaches for spell checking and correction using concepts of edit distance and n-gram similarity score. As a bonus, you can also check out a more detailed explanation and implementation of the spell check algorithm at Peter Norvig’s blog.