Tutorial Contents

Edit Distance

Edit Distance (a.k.a. Levenshtein Distance) is a measure of similarity between two strings referred to as the source string and the target string.

The distance between the source string and the target string is the minimum number of edit operations (deletions, insertions, or substitutions) required to transform the source into the target. The lower the distance, the more similar the two strings.

Among the common applications of the Edit Distance algorithm are: spell checking, plagiarism detection, and translation memory systems.

Edit Distance Python NLTK

NLTK library has the Edit Distance algorithm ready to use. Let’s take some examples.

Example #1

import nltk

w1 = 'mapping'
w2 = 'mappings'

nltk.edit_distance(w1, w2)

import nltk

w1 = 'mapping'

w2 = 'mappings'

nltk.edit_distance(w1, w2)

The output is 1 because the difference between “mapping” and “mappings” is only one character, “s”.

Example #2

Basic Spelling Checker: Let’s assume you have a mistaken word and a list of possible words and you want to know the nearest suggestion.

import nltk

mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:
    ed = nltk.edit_distance(mistake, word)
    print(word, ed)

import nltk

mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:

ed = nltk.edit_distance(mistake, word)

print(word, ed)

You will get the output:

apple 7
bag 6
drawing 4
listing 1
linking 2
living 2
lighting 1
orange 6
walking 4
zoo 7

apple 7

bag 6

drawing 4

listing 1

linking 2

living 2

lighting 1

orange 6

walking 4

zoo 7

As you can see, comparing the mistaken word “ligting” to each word in our list, the least Edit Distance is 1 for words: “listing” and “lighting” which means they are the best spelling suggestions for “ligting”. Yes, a smaller Edit Distance between two strings means they are more similar than others.

Bonus: You can have list of words from several sources, such as:

• NLTK: words = nltk.corpus.words.words()

• Check answers of this question.

• Check lists at Kaggle.

• Google: Search for “list of English words”.

Example #3

Sentence or paragraph comparison is useful in applications like plagiarism detection (to know if one article is a stolen version of another article), and translation memory systems (that save previously translated sentences and when there is a new untranslated sentence, the system retrieves a similar one that can be slightly edited by a human translator instead of translating the new sentence from scratch).

import nltk

sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.
sent5 = "I love Python programming."

ed_sent_1_2 = nltk.edit_distance(sent1, sent2)
ed_sent_1_3 = nltk.edit_distance(sent1, sent3)
ed_sent_1_4 = nltk.edit_distance(sent1, sent4)
ed_sent_1_5 = nltk.edit_distance(sent1, sent5)


print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')
print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')
print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')
print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')

import nltk

sent1 = "It might help to re-install Python if possible."

sent2 = "It can help to install Python again if possible."

sent3 = "It can be so helpful to reinstall C++ if possible."

sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.

sent5 = "I love Python programming."

ed_sent_1_2 = nltk.edit_distance(sent1, sent2)

ed_sent_1_3 = nltk.edit_distance(sent1, sent3)

ed_sent_1_4 = nltk.edit_distance(sent1, sent4)

ed_sent_1_5 = nltk.edit_distance(sent1, sent5)

print(ed_sent_1_2, 'Edit Distance between sent1 and sent2')

print(ed_sent_1_3, 'Edit Distance between sent1 and sent3')

print(ed_sent_1_4, 'Edit Distance between sent1 and sent4')

print(ed_sent_1_5, 'Edit Distance between sent1 and sent5')

You will get the output:

14 Edit Distance between sent1 and sent2
19 Edit Distance between sent1 and sent3
32 Edit Distance between sent1 and sent4
33 Edit Distance between sent1 and sent5

14 Edit Distance between sent1 and sent2

19 Edit Distance between sent1 and sent3

32 Edit Distance between sent1 and sent4

33 Edit Distance between sent1 and sent5

So it is clear that sent1 and sent2 are more similar to each other than other sentence pairs.

Jaccard Distance

Jaccard Distance is a measure of how dissimilar two sets are. The lower the distance, the more similar the two strings.

Jaccard Distance depends on another concept called “Jaccard Similarity Index” which is (the number in both sets) / (the number in either set) * 100

J(X,Y) = |X∩Y| / |X∪Y|

1 2	J(X,Y) = \|X∩Y\| / \|X∪Y\|

Then we can calculate the Jaccard Distance as follows:

 D(X,Y) = 1 – J(X,Y)

1 2	D(X,Y) = 1 – J(X,Y)

For example, if we have two strings: “mapping” and “mappings”, the intersection of the two sets is 6 because there are 7 similar characters, but the “p” is repeated while we need a set, i.e. unique characters, and the union of the two sets is 7, so the Jaccard Similarity Index is 6/7 = 0.857 and the Jaccard Distance is 1 – 0.857 = 0.142

Jaccard Distance Python NLTK

The good news is that the NLTK library has the Jaccard Distance algorithm ready to use. Let’s take some examples.

Example #1

import nltk

w1 = set('mapping')
w2 = set('mappings')

nltk.jaccard_distance(w1, w2)

import nltk

w1 = set('mapping')

w2 = set('mappings')

nltk.jaccard_distance(w1, w2)

Unlike Edit Distance, you cannot just run Jaccard Distance on the strings directly; you must first convert them to the set type.

Example #2

Basic Spelling Checker: It is the same example we had with the Edit Distance algorithm; now we are testing it with the Jaccard Distance algorithm. Let’s assume you have a mistaken word and a list of possible words and you want to know the nearest suggestion.

import nltk

mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:
    jd = nltk.jaccard_distance(set(mistake), set(word))
    print(word, jd)

import nltk

mistake = "ligting"

words = ['apple', 'bag', 'drawing', 'listing', 'linking', 'living', 'lighting', 'orange', 'walking', 'zoo']

for word in words:

jd = nltk.jaccard_distance(set(mistake), set(word))

print(word, jd)

You will get the output:

apple 0.875
bag 0.8571428571428571
drawing 0.6666666666666666
listing 0.16666666666666666
linking 0.3333333333333333
living 0.3333333333333333
lighting 0.16666666666666666
orange 0.7777777777777778
walking 0.5
zoo 1.0

apple 0.875

bag 0.8571428571428571

drawing 0.6666666666666666

listing 0.16666666666666666

linking 0.3333333333333333

living 0.3333333333333333

lighting 0.16666666666666666

orange 0.7777777777777778

walking 0.5

zoo 1.0

Again, comparing the mistaken word “ligting” to each word in our list, the least Jaccard Distance is 0.166 for words: “listing” and “lighting” which means they are the best spelling suggestions for “ligting” because they have the lowest distance.

Example #3

If you are wondering if there is a difference between the output of Edit Distance and Jaccard Distance, see this example.

import nltk

sent1 = set("It might help to re-install Python if possible.")
sent2 = set("It can help to install Python again if possible.")
sent3 = set("It can be so helpful to reinstall C++ if possible.")
sent4 = set("help It possible Python to re-install if might.") # This has the same words as sent1 with a different order.
sent5 = set("I love Python programming.")

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)
jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3)
jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4)
jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5)


print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')
print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3')
print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4')
print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5')

import nltk

sent1 = set("It might help to re-install Python if possible.")

sent2 = set("It can help to install Python again if possible.")

sent3 = set("It can be so helpful to reinstall C++ if possible.")

sent4 = set("help It possible Python to re-install if might.") # This has the same words as sent1 with a different order.

sent5 = set("I love Python programming.")

jd_sent_1_2 = nltk.jaccard_distance(sent1, sent2)

jd_sent_1_3 = nltk.jaccard_distance(sent1, sent3)

jd_sent_1_4 = nltk.jaccard_distance(sent1, sent4)

jd_sent_1_5 = nltk.jaccard_distance(sent1, sent5)

print(jd_sent_1_2, 'Jaccard Distance between sent1 and sent2')

print(jd_sent_1_3, 'Jaccard Distance between sent1 and sent3')

print(jd_sent_1_4, 'Jaccard Distance between sent1 and sent4')

print(jd_sent_1_5, 'Jaccard Distance between sent1 and sent5')

You will get the result:

0.18181818181818182 Jaccard Distance between sent1 and sent2
0.36 Jaccard Distance between sent1 and sent3
0.0 Jaccard Distance between sent1 and sent4
0.22727272727272727 Jaccard Distance between sent1 and sent5

0.18181818181818182 Jaccard Distance between sent1 and sent2

0.36 Jaccard Distance between sent1 and sent3

0.0 Jaccard Distance between sent1 and sent4

0.22727272727272727 Jaccard Distance between sent1 and sent5

Just like when we applied Edit Distance, sent1 and sent2 are the most similar sentences. However, look to the other results; they are completely different. The most obvious difference is that the Edit Distance between sent1 and sent4 is 32 and the Jaccard Distance is zero, which means the Jaccard Distance algorithms sees them as identical sentence because Edit Distance depends on counting edit operations from the start to end of the string while Jaccard Distance just counts the number characters and then apply some calculations on this number as mentioned above. Actually, there is no “right” or “wrong” answer; it all depends on what you really need to do.

Tokenization

If you want to work on word level instead of character level, you might want to apply tokenization first before calculating Edit Distance and Jaccard Distance. This can be useful if you want to exclude specific sort of tokens or if you want to run some pre-operations like lemmatization or stemming.

tokens = nltk.word_tokenize(sent)

1 2	tokens = nltk.word_tokenize(sent)

n-gram

In general, n-gram means splitting a string in sequences with the length n. So if we have this string “abcde”, then bigrams are: ab, bc, cd, and de while trigrams will be: abc, bcd, and cde while 4-grams will be abcd, and bcde.

n-grams can be used with Jaccard Distance. n-grams per se are useful in other applications such as machine translation when you want to find out which phrase in one language usually comes as the translation of another phrase in the target language.

Back to Jaccard Distance, let’s see how to use n-grams on the string directly, i.e. on the character level, or after tokenization, i.e. on the token level.

Example #1: Character Level

import nltk

sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.
sent5 = "I love Python programming."


ng1_chars = set(nltk.ngrams(sent1, n=3))
ng2_chars = set(nltk.ngrams(sent2, n=3))
ng3_chars = set(nltk.ngrams(sent3, n=3))
ng4_chars = set(nltk.ngrams(sent4, n=3))
ng5_chars = set(nltk.ngrams(sent5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_chars, ng2_chars)
jd_sent_1_3 = nltk.jaccard_distance(ng1_chars, ng3_chars)
jd_sent_1_4 = nltk.jaccard_distance(ng1_chars, ng4_chars)
jd_sent_1_5 = nltk.jaccard_distance(ng1_chars, ng5_chars)

print(jd_sent_1_2, "Jaccard Distance between sent1 and sent2 with ngram 3")
print(jd_sent_1_3, "Jaccard Distance between sent1 and sent3 with ngram 3")
print(jd_sent_1_4, "Jaccard Distance between sent1 and sent4 with ngram 3")
print(jd_sent_1_5, "Jaccard Distance between sent1 and sent5 with ngram 3")

import nltk

sent1 = "It might help to re-install Python if possible."

sent2 = "It can help to install Python again if possible."

sent3 = "It can be so helpful to reinstall C++ if possible."

sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.

sent5 = "I love Python programming."

ng1_chars = set(nltk.ngrams(sent1, n=3))

ng2_chars = set(nltk.ngrams(sent2, n=3))

ng3_chars = set(nltk.ngrams(sent3, n=3))

ng4_chars = set(nltk.ngrams(sent4, n=3))

ng5_chars = set(nltk.ngrams(sent5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_chars, ng2_chars)

jd_sent_1_3 = nltk.jaccard_distance(ng1_chars, ng3_chars)

jd_sent_1_4 = nltk.jaccard_distance(ng1_chars, ng4_chars)

jd_sent_1_5 = nltk.jaccard_distance(ng1_chars, ng5_chars)

print(jd_sent_1_2, "Jaccard Distance between sent1 and sent2 with ngram 3")

print(jd_sent_1_3, "Jaccard Distance between sent1 and sent3 with ngram 3")

print(jd_sent_1_4, "Jaccard Distance between sent1 and sent4 with ngram 3")

print(jd_sent_1_5, "Jaccard Distance between sent1 and sent5 with ngram 3")

Example #2: Token Level

import nltk

sent1 = "It might help to re-install Python if possible."
sent2 = "It can help to install Python again if possible."
sent3 = "It can be so helpful to reinstall C++ if possible."
sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.
sent5 = "I love Python programming."

tokens1 = nltk.word_tokenize(sent1)
tokens2 = nltk.word_tokenize(sent2)
tokens3 = nltk.word_tokenize(sent3)
tokens4 = nltk.word_tokenize(sent4)
tokens5 = nltk.word_tokenize(sent5)

ng1_tokens = set(nltk.ngrams(tokens1, n=3))
ng2_tokens = set(nltk.ngrams(tokens2, n=3))
ng3_tokens = set(nltk.ngrams(tokens3, n=3))
ng4_tokens = set(nltk.ngrams(tokens4, n=3))
ng5_tokens = set(nltk.ngrams(tokens5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)
jd_sent_1_3 = nltk.jaccard_distance(ng1_tokens, ng3_tokens)
jd_sent_1_4 = nltk.jaccard_distance(ng1_tokens, ng4_tokens)
jd_sent_1_5 = nltk.jaccard_distance(ng1_tokens, ng5_tokens)

print(jd_sent_1_2, "Jaccard Distance between tokens1 and tokens2 with ngram 3")
print(jd_sent_1_3, "Jaccard Distance between tokens1 and tokens3 with ngram 3")
print(jd_sent_1_4, "Jaccard Distance between tokens1 and tokens4 with ngram 3")
print(jd_sent_1_5, "Jaccard Distance between tokens1 and tokens5 with ngram 3")

import nltk

sent1 = "It might help to re-install Python if possible."

sent2 = "It can help to install Python again if possible."

sent3 = "It can be so helpful to reinstall C++ if possible."

sent4 = "help It possible Python to re-install if might." # This has the same words as sent1 with a different order.

sent5 = "I love Python programming."

tokens1 = nltk.word_tokenize(sent1)

tokens2 = nltk.word_tokenize(sent2)

tokens3 = nltk.word_tokenize(sent3)

tokens4 = nltk.word_tokenize(sent4)

tokens5 = nltk.word_tokenize(sent5)

ng1_tokens = set(nltk.ngrams(tokens1, n=3))

ng2_tokens = set(nltk.ngrams(tokens2, n=3))

ng3_tokens = set(nltk.ngrams(tokens3, n=3))

ng4_tokens = set(nltk.ngrams(tokens4, n=3))

ng5_tokens = set(nltk.ngrams(tokens5, n=3))

jd_sent_1_2 = nltk.jaccard_distance(ng1_tokens, ng2_tokens)

jd_sent_1_3 = nltk.jaccard_distance(ng1_tokens, ng3_tokens)

jd_sent_1_4 = nltk.jaccard_distance(ng1_tokens, ng4_tokens)

jd_sent_1_5 = nltk.jaccard_distance(ng1_tokens, ng5_tokens)

print(jd_sent_1_2, "Jaccard Distance between tokens1 and tokens2 with ngram 3")

print(jd_sent_1_3, "Jaccard Distance between tokens1 and tokens3 with ngram 3")

print(jd_sent_1_4, "Jaccard Distance between tokens1 and tokens4 with ngram 3")

print(jd_sent_1_5, "Jaccard Distance between tokens1 and tokens5 with ngram 3")

You can run the two codes and compare results. Again, choosing which algorithm to use all depends on what you want to do.

If you have questions, please feel free to write them in a comment below.

Yasmin Moslem

Machine Translation Researcher and Translation Technology Consultant

machinetranslation.io/

Rating: 4.8/5. From 25 votes.

Please wait...

Edit Distance and Jaccard Distance Calculation with NLTK

Edit Distance

Edit Distance Python NLTK

Example #1

Example #2

Example #3

Jaccard Distance

Jaccard Distance Python NLTK

Example #1

Example #2

Example #3

Tokenization

n-gram

Example #1: Character Level

Example #2: Token Level

Related

Leave a Reply Cancel reply

Edit Distance

Edit Distance Python NLTK

Example #1

Example #2

Example #3

Jaccard Distance

Jaccard Distance Python NLTK

Example #1

Example #2

Example #3

Tokenization

n-gram

Example #1: Character Level

Example #2: Token Level

Share this tutorial:

Related

Leave a Reply Cancel reply

Want to learn more?