Lexical Resources and NLP Pipeline

After learning the basics of nltk and how to manipulate corpora, you will learn important concepts in NLP that you will use throughout the following tutorials.

Tutorial Contents

Lexical Resources Terms

Lexical resource is a database containing several dictionaries or corpora. Dictionaries are lists of stop words, homonyms, usual words, etc. To start this tutorial, there are some definitions you have to know.

Tokenization is the act of breaking up a sequence of strings into pieces called tokens. These tokens can be something like words or keywords. You will see more about this later in this tutorial.

Homonyms are two distinct words that have the same spelling. Homonyms can create a problem when dealing with language because it is difficult to process and distinguish them. Examples of these words are pike and pike.

Stop words are commonly used words that the search engine will filter out before the processing. Examples of these words are “the”, “a” and “is”.

Understanding Lexical Resources Using NLTK

To understand better these concepts and start with lexical resources, let’s create a file named lexical-resources-vocabulary.py . The function you will write in this file is the unusual_words function. What this function does is it takes a list of words and returns the words that are not usually used.

Let’s start with the imports:

import nltk from nltk.corpus
import gutenberg as gt

import nltk from nltk.corpus

import gutenberg as gt

Following with the function itself:

# input - list of words 
# output - returns a list of unusual words 
def unusual_words(text):
     text_vocab = set([w.lower() for w in text if w.isalpha()])
     english_vocab = set([w.lower() for w in nltk.corpus.words.words() ])
     unusual_list = text_vocab.difference(english_vocab)
     return sorted(unusual_list)

# input - list of words

# output - returns a list of unusual words

def unusual_words(text):

text_vocab = set([w.lower() for w in text if w.isalpha()])

english_vocab = set([w.lower() for w in nltk.corpus.words.words() ])

unusual_list = text_vocab.difference(english_vocab)

return sorted(unusual_list)

In the function, you can see that are three variables, text_vocab , english_vocab and unusual_list .

The first variable is a normalized version of the input list. If you remember from the past tutorials, if you try to search the word “the” in your text and it is not normalized, you may not receive a result even if the word appears in the text.

The english_vocab variable is a set containing all the usual words in English. As you can see, nltk module has a function that returns you a list of those words, nltk.corpus.words.words() .

The last variable, unusual_list , contains the difference between text_vocab and english_vocab , in other words, all the words in text_vocab that are not in english_vocab .

You can test your code by adding the following lines:

list_of_unusual_words = unusual_words(gt.words('austen-emma.txt'))
print(list_of_unusual_words)

list_of_unusual_words = unusual_words(gt.words('austen-emma.txt'))

print(list_of_unusual_words)

If you run the code, you will get a list of the unusual words present in the text.

List of unusual words

In nltk module, you also have a corpus of stop words. If you want to see the stop words in English, for example, you can use the following code:

from nltk.corpus import stopwords
print(stopwords.words('english'))

from nltk.corpus import stopwords

print(stopwords.words('english'))

Now that you are familiar with the concept of lexical resource, you can continue with what is the NLP pipeline.

NLP Pipeline

Before starting processing raw text, you need first to be familiar with the architecture of the NLP. Let’s say you want to process the text in this site (https://www.nrdc.org/stories/global-warming-101). You can see that you cannot simply analyze the HTML from the site because you do not actually want everything present in the page, like images or icons.

So what you have to do here? First, you need to download the HTML, then you tokenize the text and finally, you normalize the words. This is what we call the NLP pipeline. As web scraping is not the focus of this tutorial, let’s start the NLP pipeline with the tokenization.

Note: You will be using the play that you downloaded from the previous tutorial, Taming of the Shrew, so if you have not downloaded it yet, you can download it from this link.

Tokenization

You will start with tokenization with two functions in nltk.tokenize module, the word_tokezine and sent_tokenize . As the name suggests, the first function will divide your text into words and the second function will divide your text into sentences. These two functions get a string as their parameters and return a list of tokens as their output.

You will start the example creating a file named tokenization.py and add the following lines:

from nltk.tokenize import word_tokenize, sent_tokenize

def read_file(filename):
     with open(filename, 'r') as file:
         text = file.read()
     return text 

text = read_file("shakespeare-taming-2.txt")

from nltk.tokenize import word_tokenize, sent_tokenize

def read_file(filename):

with open(filename, 'r') as file:

text = file.read()

return text

text = read_file("shakespeare-taming-2.txt")

What you are doing with these lines is that you are importing the functions word_tokenize and sent_tokenize from nltk.tokenize module and you are creating a function to read your file and return it as a string.

Now, you can use the function with a single line of code:

words = word_tokenize(text)

1 2	words = word_tokenize(text)

And this will tokenize divide your text into words. You can print the variable to see the results.

Words present in the text

If you want all the words to be unique, you can use transform it into a set with:

words = set(words)

1 2	words = set(words)

And you can see how many unique words the text has as well:

print(len(words))

1 2	print(len(words))

It’s a good habit to transform the list of words into a set because repeated words can create an overhead when you are analyzing your text.

You can use the sent_tokenize in the same way as the word_tokenize function, but it will return you a list of sentences. Notice in the image below that you can use the strip() function to remove all the “\n”, “\t” and blank spaces from the sentences.

This is all for this tutorial and if you have any questions, feel free to leave a comment below.

NLTK Course

Join our NLTK comprehensive course and learn how to create sophisticated applications using NLTK, including Gender Predictor, and Document Classifier, Spelling Checker, Plagiarism Detector, and Translation Memory system.

https://www.udemy.com/natural-language-processing-python-nltk/?couponCode=NLTK-BLOGS

No votes yet.

Please wait...

Lexical Resources Terms

Understanding Lexical Resources Using NLTK

NLP Pipeline

Tokenization

NLTK Course

Related

Leave a Reply Cancel reply

Lexical Resources Terms

Understanding Lexical Resources Using NLTK

NLP Pipeline

Tokenization

NLTK Course

Share this tutorial:

Related

Leave a Reply Cancel reply

Want to learn more?