Chatbot Development with Python NLTK

Chatbots are intelligent agents that engage in a conversation with the humans in order to answer user queries on a certain topic. Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana are some of the examples of chatbots.

Depending upon the functionality, chatbots can be divided into three categories: General purpose chatbots, task-oriented chatbots, and hybrid chatbots. General purpose chatbots are the chatbots that conduct a general discussion with the user (not on any specific topic). Task-oriented chatbots, on the other hand, are designed to perform specialized tasks, for example, to serve as online ticket reservation system or pizza delivery system, etc. Finally, hybrid chatbots are designed for both general and task-oriented discussions.

Chatbot Development Approaches

There are two major approaches for developing chatbots: Rule-based approaches and learning-based approaches.

Rule-based Approaches

In rule-based approaches, there is a fixed set of responses available and based on a certain rule, a response is selected. For instance, if a user says “hello”, there might be an if statement in the chatbot that implements the logic that  whenever a user says “hello”, generate a response which says “hi, how are you?”One of the advantages of the rule-based approach is that they are often very correct since there is a perfect response to a query. However, rule-based chatbots do not scale well and cannot give reply to user inputs for which no rule is defined. In order to answer a large number of user queries, a large number of rules have to be implemented.

Learning-based Approaches

Learning-based approaches use statistical algorithms such as Machine Learning algorithms to learn from the data and generate responses based on that learning. One of the advantages of learning based approaches is that they scale well. However, learning based approaches require a huge amount of data to train and may not be very accurate.

Learning-based approaches have been further divided into two categories: Generative approaches and Retrieval approaches. Generative chatbots are the type of chatbots that learn to generate words in response to a user query. On the other hand, retrieval chatbots learn a complete response to be generated as a result of user inputs. Generative approaches are more flexible as compared to retrieval approaches as depending upon the user input, a response is generated on the fly.

 

 Developing a Chatbot in Python

In this tutorial, we will develop a very simple task-oriented chatbot system capable of answer questions related to global warming. The chatbot will be fairly simple and will generate answers based on cosine similar.

Downloading Required Libraries

Before we can proceed with the code, we need to download the following libraries:

  1. Chatbot development falls in the broader category of Natural Language processing. We will be using a natural language processing library NLTK to create our chatbot.  The installation instructions for NLTK can be found at this official link.
  2. The dataset used for creating our chatbot will be the Wikipedia article on global warming. To scrape the article, we will use the BeautifulSoap library for Python. The download instructions for the library are available here.
  3. The BeautifulSoap library scrapes the data from a website in HTML format. To parse the HTML, we will use the LXML library. The download instructions for the library are available at official link.

We will be using the Anaconda distribution of Python, the rest of the libraries are built-in the Anaconda distribution and you do not have to download them.

Let’s now start our chatbot development. We need to perform the following steps:

Importing the Required Libraries

The first step is to import the required libraries. Look at the following script:

In the script above we import the beautifulsoup4 library for parsing the webpage, the urllib library to make a connection to a remote webpage, the re  library for performing regex operation, the nltk  library for natural language processing, and the numpy  library for basic array operations. The random  library is used for random number generation. We will see how we use it later in the article. And finally, the string  library is used for string manipulation.

Scraping  and Preprocessing the Wikipedia Article

Once we imported the required libraries, we are good to go and scrape the article from Wikipedia. As I said earlier, our chatbot will be able to answer questions related to global warming. To develop such a chatbot, we need a dataset that contains information about global warming. One of the sources of such information is the Wikipedia article on global warming. The following script scrapes the article and extracts paragraphs from the article.

In the script above we first use the urlopen  function from the urllib.request module to fetch raw data from Wikipedia. Next, we use the read method to read the data. The data retrieved using the urlopen function is in binary format, to convert it into HTML format, we can use the BeautifulSoup class and pass it our raw data along with the lxml  object, as parameters. From the HTML data, we need to extract the paragraphs, we can do so using the find_all  method. We need to pass p  as the parameter to the method which stands for paragraphs. Finally, we joined all the paragraphs and converted the final text into lowercase.

We will remove numbers from our dataset and replace multiple empty spaces with single space. This step is optional, you can skip it.

The next step is to tokenize the article into sentences and words. Tokenizations simply refers to splitting the article.

The following script tokenizes the article into sentences:

And the following script tokenizes the article into words:

Lemmatization and Punctuation Removal

Lemmatization refers to reducing the word to its root form, as available in the dictionary. For instance, the lemmatized version of the word eating will be eat , better will be good , medium will be media and so on.

Lemmatization helps find similarity between the words since similar words can be used in different tense and different degrees. Lemmatizing them makes them uniform.

Similarly, we will remove punctuations from our text because punctuations do not convey any meaning and if we do not remove them, they will also be treated as tokens.

We will use NLTK’s punkt  and wordnet modules for punctuation removal. We can then use the WordNetLemmatizer object from the nltk.stem  module for lemmatizing the words.

Look at the following script:

In the script above two helper functions, LemmatizeWords and RemovePunctuations  have been defined. The RemovePunctuations  function accepts a text string, perform lemmatization on the string by passing it to LemmatizeWords function which lemmatizes the words. Punctuations are also removed from the text.

Handling Greetings

The Wikipedia article doesn’t contain any text for handling greetings, however, we want our chatbot to reply to greetings. For that, we will create a function that handles greetings. Basically, we will create two lists with different types of greeting messages. The user input will be checked against the words in one of the greeting lists, if the user input contains a word that is in the first greeting list, the response will be randomly chosen from the second greeting list.  The following script does that:

Response Generation

Next, we need to create a method for general response generation. To do so we need to convert our words to vectors or numbers and then apply cosine similarity to find the similar vectors. The intuition behind this approach is that the response words should have the highest cosine similarity with user input words. To convert word to vectors we will use TF-IDF approach . We can use TfidfVectorizer  from the sklearn.feature_extraction.text  module to convert words to their TF-IDF counterparts.  Similarly, to find the cosine similarity, the  cosine_similarity method from the sklearn.metrics.pairwise  class can be used.  The following script imports these modules:

The following function is used for response generation:

The above function simply takes user input as a parameter, lemmatizes, removes punctuations, and then create TFIDF vectors from the words in the sentence. TFIDF vectors for the already existing sentences in the article is also created. Next, cosine similarity between vectors of the words in the sentence entered by the user and the existing sentences is found and the sentence with the highest cosine similarity is returned as a response. In case if no cosine similarity is found between user input and any sentence in the article, the response is generated that the sentence is not understood.

Interacting with User

Now we have created a method for user interaction. We need to create logic to interact with the user. Look at the following method:

In the script above, we set a flag continue_discussion  to True. Next,  execute a while loop inside which we ask the user to input his/her questions regarding global warming.  The loop executes until the  continue_discussion  flag is set to True. If the user input is equal to the string ‘bye’, the loop terminates by setting  continue_discussion  flag to False . Else if the user input contains words like thank ‘thanks’, ‘thank you very much’ or ‘thank you’ the response generated will be ‘Chatbot: Most welcome’.  If the user input contains a greeting, the response generated will contain greeting. Finally, if the user input doesn’t contain ‘bye’ or ‘thank you’ words or greetings, the user input is sent to give_reply function that we created in the last section, the function returns an appropriate response based on cosine similarity.

If you run the above script, you should see a text box asking you for any question regarding global warming, based on the question, a response will be generated. Screenshot of the output can be seen below:

From the output, you could see that I entered the question: “What is global warming” and I received a very good response from the chatbot.  You can ignore the warning. It is appearing because the stop words from NLTK doesn’t contain words like “ha”, “le”, “u” etc.

Complete Code for the Application

 

Conclusion

Chatbots are conversational agents that can talk with the user on general topics as well as for providing specialized services. In this article, we created a very simple chatbot that generates a response based on a fixed set of rules and cosine similarity between the sentences. The chatbot answers questions related to global warming. To practice chatbot development further, I would suggest that you create a similar chatbot that answers questions related to some other topic.

 

 

No votes yet.
Please wait...

Leave a Reply