Lexical Resources and NLP Pipeline

After learning the basics of nltk and how to manipulate corpora, you will learn important concepts in NLP that you will use throughout the following tutorials.

Lexical Resources Terms

Lexical resource is a database containing several dictionaries or corpora. Dictionaries are lists of stop words, homonyms, usual words, etc. To start this tutorial, there are some definitions you have to know.

Tokenization is the act of breaking up a sequence of strings into pieces called tokens. These tokens can be something like words or keywords. You will see more about this later in this tutorial.

Homonyms are two distinct words that have the same spelling. Homonyms can create a problem when dealing with language because it is difficult to process and distinguish them. Examples of these words are pike and pike.

Stop words are commonly used words that the search engine will filter out before the processing. Examples of these words are “the”, “a” and “is”.

Understanding Lexical Resources Using NLTK

To understand better these concepts and start with lexical resources, let’s create a file named lexical-resources-vocabulary.py . The function you will write in this file is the unusual_words  function. What this function does is it takes a list of words and returns the words that are not usually used.

Let’s start with the imports:

Following with the function itself:

In the function, you can see that are three variables, text_vocab , english_vocab  and unusual_list .

The first variable is a normalized version of the input list. If you remember from the past tutorials, if you try to search the word “the” in your text and it is not normalized, you may not receive a result even if the word appears in the text.

The english_vocab  variable is a set containing all the usual words in English. As you can see, nltk  module has a function that returns you a list of those words, nltk.corpus.words.words() .

The last variable, unusual_list , contains the difference between text_vocab  and english_vocab , in other words, all the words in text_vocab  that are not in english_vocab .

You can test your code by adding the following lines:

If you run the code, you will get a list of the unusual words present in the text.

List of unusual words

In nltk  module, you also have a corpus of stop words. If you want to see the stop words in English, for example, you can use the following code:

Now that you are familiar with the concept of lexical resource, you can continue with what is the NLP pipeline.

NLP Pipeline

Before starting processing raw text, you need first to be familiar with the architecture of the NLP. Let’s say you want to process the text in this site (https://www.nrdc.org/stories/global-warming-101). You can see that you cannot simply analyze the HTML from the site because you do not actually want everything present in the page, like images or icons.

So what you have to do here? First, you need to download the HTML, then you tokenize the text and finally, you normalize the words. This is what we call the NLP pipeline. As web scraping is not the focus of this tutorial, let’s start the NLP pipeline with the tokenization.

Note: You will be using the play that you downloaded from the previous tutorial, Taming of the Shrew, so if you have not downloaded it yet, you can download it from this link.


You will start with tokenization with two functions in nltk.tokenize  module, the word_tokezine  and sent_tokenize . As the name suggests, the first function will divide your text into words and the second function will divide your text into sentences. These two functions get a string as their parameters and return a list of tokens as their output.

You will start the example creating a file named tokenization.py  and add the following lines:

What you are doing with these lines is that you are importing the functions word_tokenize  and sent_tokenize  from nltk.tokenize  module and you are creating a function to read your file and return it as a string.

Now, you can use the function with a single line of code:

And this will tokenize divide your text into words. You can print the variable to see the results.

Words present in the text

If you want all the words to be unique, you can use transform it into a set with:

And you can see how many unique words the text has as well:

It’s a good habit to transform the list of words into a set because repeated words can create an overhead when you are analyzing your text.

You can use the sent_tokenize  in the same way as the word_tokenize  function, but it will return you a list of sentences. Notice in the image below that you can use the strip()  function to remove all the “\n”, “\t” and blank spaces from the sentences.

This is all for this tutorial and if you have any questions, feel free to leave a comment below.

NLTK Course

Join our NLTK comprehensive course and learn how to create sophisticated applications using NLTK, including Gender Predictor, and Document Classifier, Spelling Checker, Plagiarism Detector, and Translation Memory system.



No votes yet.
Please wait...

Leave a Reply