Edit Distance
Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two strings referred to as the source string and the target string.
The distance between the source string and the target string is the minimum number of edit operations (deletions, insertions, or substitutions) required to transform the source into the target. The lower the distance, the more similar the two strings.
Among the common applications of the Edit Distance algorithm are: spell checking, plagiarism detection, and translation memory systems.
Edit Distance Python NLTK
NLTK library has the Edit Distance algorithm ready to use. Let’s take some examples.
Example #1
1 2 3 4 5 6 7 |
import nltk w1 = 'mapping' w2 = 'mappings' nltk.edit_distance(w1, w2) |
The output is 1 because the difference between “mapping” and “mappings” is only one character, “s”.
Example #2
Basic Spelling Checker: Let’s assume you have a mistaken word and a list of possible words and you want to know the nearest suggestion.
1 2 3 4 5 6 7 8 9 10 |
import nltk mistake = "ligting" words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo'] for word in words: ed = nltk.edit_distance(mistake, word) print(word, ed) |
You will get the output:
Example #3
Sentence or paragraph comparison is useful in applications like plagiarism detection (to know if one article is a stolen version of another article), and translation memory systems (that save previously translated sentences and when there is a new untranslated sentence, the system retrieves a similar one that can be slightly edited by a human translator instead of translating the new sentence from scratch).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import nltk sent1 = "It might help to re-install Python if possible." sent2 = "It can help to install Python again if possible." sent3 = "It can be so helpful to reinstall C++ if possible." sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order. sent5 = "I love Python programming." ed_sent_1_2 = nltk.edit_distance(sent1, sent2) ed_sent_1_3 = nltk.edit_distance(sent1, sent3) ed_sent_1_4 = nltk.edit_distance(sent1, sent4) ed_sent_1_5 = nltk.edit_distance(sent1, sent5) print(ed_sent_1_2, 'Edit Distance between sent1 and sent2') print(ed_sent_1_3, 'Edit Distance between sent1 and sent3') print(ed_sent_1_4, 'Edit Distance between sent1 and sent4') print(ed_sent_1_5, 'Edit Distance between sent1 and sent5') |
You will get the output:
1 2 3 4 5 |
14 Edit Distance between sent1 and sent2 19 Edit Distance between sent1 and sent3 32 Edit Distance between sent1 and sent4 33 Edit Distance between sent1 and sent5 |
Jaccard Distance
1 2 |
J(X,Y) = |X∩Y| / |X∪Y| |
1 2 |
D(X,Y) = 1 – J(X,Y) |
Jaccard Distance Python NLTK
The good news is that the NLTK library has the Jaccard Distance algorithm ready to use. Let’s take some examples.
Example #1
1 2 3 4 5 6 7 |
import nltk w1 = set('mapping') w2 = set('mappings') nltk.jaccard_distance(w1, w2) |
Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly; you must first convert them to the set type.
Example #2
Basic Spelling Checker: It is the same example we had with the Edit Distance algorithm; now we are testing it with the Jaccard Distance algorithm. Let’s assume you have a mistaken word and a list of possible words and you want to know the nearest suggestion.
1 2 3 4 5 6 7 8 9 10 |
import nltk mistake = "ligting" words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo'] for word in words: jd = nltk.jaccard_distance(set(mistake), set(word)) print(word, jd) |
You will get the output:
Example #3
If you are wondering if there is a difference between the output of Edit Distance and Jaccard Distance, see this example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import nltk sent1 = set("It might help to re-install Python if possible.") sent2 = set("It can help to install Python again if possible.") sent3 = set("It can be so helpful to reinstall C++ if possible.") sent4 = set("help It possible Python to re-install if might.") # This has the same words as sent1 with a different order. sent5 = set("I love Python programming.") jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2) jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3) jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4) jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5) print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2') print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3') print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4') print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5') |
You will get the result:
Tokenization
1 2 |
tokens = nltk.word_tokenize(sent) |
n-gram
In general, n-gram means splitting a string in sequences with the length n. So if we have this string “abcde”, then bigrams are: ab, bc, cd, and de while trigrams will be: abc, bcd, and cde while 4-grams will be abcd, and bcde.
n-grams can be used with Jaccard Distance. n-grams per se are useful in other applications such as machine translation when you want to find out which phrase in one language usually comes as the translation of another phrase in the target language.
Back to Jaccard Distance, let’s see how to use n-grams on the string directly, i.e. on the character level, or after tokenization, i.e. on the token level.
Example #1: Character Level
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
import nltk sent1 = "It might help to re-install Python if possible." sent2 = "It can help to install Python again if possible." sent3 = "It can be so helpful to reinstall C++ if possible." sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order. sent5 = "I love Python programming." ng1_chars = set(nltk.ngrams(sent1, n=3)) ng2_chars = set(nltk.ngrams(sent2, n=3)) ng3_chars = set(nltk.ngrams(sent3, n=3)) ng4_chars = set(nltk.ngrams(sent4, n=3)) ng5_chars = set(nltk.ngrams(sent5, n=3)) jd_sent_1_2 = nltk.jaccard_distance(ng1_chars, ng2_chars) jd_sent_1_3 = nltk.jaccard_distance(ng1_chars, ng3_chars) jd_sent_1_4 = nltk.jaccard_distance(ng1_chars, ng4_chars) jd_sent_1_5 = nltk.jaccard_distance(ng1_chars, ng5_chars) print(jd_sent_1_2, "Jaccard Distance between sent1 and sent2 with ngram 3") print(jd_sent_1_3, "Jaccard Distance between sent1 and sent3 with ngram 3") print(jd_sent_1_4, "Jaccard Distance between sent1 and sent4 with ngram 3") print(jd_sent_1_5, "Jaccard Distance between sent1 and sent5 with ngram 3") |
Example #2: Token Level
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
import nltk sent1 = "It might help to re-install Python if possible." sent2 = "It can help to install Python again if possible." sent3 = "It can be so helpful to reinstall C++ if possible." sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order. sent5 = "I love Python programming." tokens1 = nltk.word_tokenize(sent1) tokens2 = nltk.word_tokenize(sent2) tokens3 = nltk.word_tokenize(sent3) tokens4 = nltk.word_tokenize(sent4) tokens5 = nltk.word_tokenize(sent5) ng1_tokens = set(nltk.ngrams(tokens1, n=3)) ng2_tokens = set(nltk.ngrams(tokens2, n=3)) ng3_tokens = set(nltk.ngrams(tokens3, n=3)) ng4_tokens = set(nltk.ngrams(tokens4, n=3)) ng5_tokens = set(nltk.ngrams(tokens5, n=3)) jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens) jd_sent_1_3 = nltk.jaccard_distance(ng1_tokens, ng3_tokens) jd_sent_1_4 = nltk.jaccard_distance(ng1_tokens, ng4_tokens) jd_sent_1_5 = nltk.jaccard_distance(ng1_tokens, ng5_tokens) print(jd_sent_1_2, "Jaccard Distance between tokens1 and tokens2 with ngram 3") print(jd_sent_1_3, "Jaccard Distance between tokens1 and tokens3 with ngram 3") print(jd_sent_1_4, "Jaccard Distance between tokens1 and tokens4 with ngram 3") print(jd_sent_1_5, "Jaccard Distance between tokens1 and tokens5 with ngram 3") |
You can run the two codes and compare results. Again, choosing which algorithm to use all depends on what you want to do.
If you have questions, please feel free to write them in a comment below.
Machine Translation Researcher and Translation Technology Consultant