NLTK Regular Expressions

You know how to tokenize text, but now what can you do with it? In this tutorial, you will learn how to use regular expressions along with NLTK.

Regular Expressions with NLTK

Assuming you have a background on Regular Expressions, we will focus this section in using the search  function present in re  module.

To start this tutorial, create a file named  and import the following modules:

You will use the same text from the previous tutorial, “Taming of the Shrew”, and the same read_file  function, so add to your code:

Like you saw in the previous tutorial, you will have to normalize your text and tokenize it in order to get the words, so add the following code:

Now, let’s talk about the search  function. The search function is present in the re  module and it takes two parameters: the first is a RegEX patter and the second parameter is the string which you want to apply the pattern. For example, let’s say you want to search all words that start with “a” in the string "abc def" . The code you will write is:

A useful thing to note is that you can use the search function in an if  statement, so the following code will print a message if the pattern is found:

Using re function

You can use it as well in list comprehensions to find words that end with “ed”, for example:

Output of

Talking about RegEX, you cannot forget about ranges and closures. Let’s say that you want to find all words that end with one or more “e”. You can use the +  operator:

Similarly, you can use *  operator to search for zero or more occurrences of a certain pattern. So if you want to see all words that end with zero or more “e”, you can use:

words_that_may_end_with_e = [w for w in words if"e*$", w)]

Applications of RegEX

Now that you are familiar with the search function, you are going to search through tokenized text using the findall  method from nltk.text.Text  class. As you already saw, this class expects a list of words in its constructor:

As you can see, in this example we are going to use a text present in Gutenberg corpus.

The findall  method expects a regular expression as its parameter but its regular expression is a bit different from the normal regular expression. The Text  class receives a tokenized list of words and when you call the findall  method, you need to specify these tokens.

Let’s say you want to search sentences that that start with “a”, some word and end with “man”. In this example, you need to search three tokens where the second word can be any word. So the code for this is:

As you can see, the tokens are separated with <>  and for each token, you have to specify the RegEX. To make it more clear, let’s see another example using nps_chat  corpus.

chat_obj = nltk.Text(nps_chat.words())

Let’s say you want to search sentences with three words that end with “bro”. Given that only the last word matters, you can use <.*>  for the first two words, to accept anything, and <bro>  for the last one. So the code you will use is:

Now, let’s create our own nltk.text.Text  object. To create a Text  object, you need a list of words, so first create a string:

and tokenize it:

And now you can use the findall method:

Note that as this is an  nltk.text.Text  object, you can use all the functions mentioned in the previous tutorials such as concordance , similar  and count .

This is all for this tutorial. If you have any question, feel free to leave it in the comments below.

NLTK Course

Join our NLTK comprehensive course and learn how to create sophisticated applications using NLTK, including Gender Predictor, and Document Classifier, Spelling Checker, Plagiarism Detector, and Translation Memory system.


No votes yet.
Please wait...

Leave a Reply