Chatbots are intelligent agents that engage in a conversation with the humans in order to answer user queries on a certain topic. Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana are some of the examples of chatbots.
Depending upon the functionality, chatbots can be divided into three categories: General purpose chatbots, task-oriented chatbots, and hybrid chatbots. General purpose chatbots are the chatbots that conduct a general discussion with the user (not on any specific topic). Task-oriented chatbots, on the other hand, are designed to perform specialized tasks, for example, to serve as online ticket reservation system or pizza delivery system, etc. Finally, hybrid chatbots are designed for both general and task-oriented discussions.
Chatbot Development Approaches
There are two major approaches for developing chatbots: Rule-based approaches and learning-based approaches.
Rule-based Approaches
In rule-based approaches, there is a fixed set of responses available and based on a certain rule, a response is selected. For instance, if a user says “hello”, there might be an if statement in the chatbot that implements the logic that whenever a user says “hello”, generate a response which says “hi, how are you?”One of the advantages of the rule-based approach is that they are often very correct since there is a perfect response to a query. However, rule-based chatbots do not scale well and cannot give reply to user inputs for which no rule is defined. In order to answer a large number of user queries, a large number of rules have to be implemented.
Learning-based Approaches
Learning-based approaches use statistical algorithms such as Machine Learning algorithms to learn from the data and generate responses based on that learning. One of the advantages of learning based approaches is that they scale well. However, learning based approaches require a huge amount of data to train and may not be very accurate.
Learning-based approaches have been further divided into two categories: Generative approaches and Retrieval approaches. Generative chatbots are the type of chatbots that learn to generate words in response to a user query. On the other hand, retrieval chatbots learn a complete response to be generated as a result of user inputs. Generative approaches are more flexible as compared to retrieval approaches as depending upon the user input, a response is generated on the fly.
Developing a Chatbot in Python
In this tutorial, we will develop a very simple task-oriented chatbot system capable of answer questions related to global warming. The chatbot will be fairly simple and will generate answers based on cosine similar.
Downloading Required Libraries
Before we can proceed with the code, we need to download the following libraries:
- Chatbot development falls in the broader category of Natural Language processing. We will be using a natural language processing library NLTK to create our chatbot. The installation instructions for NLTK can be found at this official link.
- The dataset used for creating our chatbot will be the Wikipedia article on global warming. To scrape the article, we will use the BeautifulSoap library for Python. The download instructions for the library are available here.
- The BeautifulSoap library scrapes the data from a website in HTML format. To parse the HTML, we will use the LXML library. The download instructions for the library are available at official link.
We will be using the Anaconda distribution of Python, the rest of the libraries are built-in the Anaconda distribution and you do not have to download them.
Let’s now start our chatbot development. We need to perform the following steps:
Importing the Required Libraries
The first step is to import the required libraries. Look at the following script:
1 2 3 4 5 6 7 8 9 |
import bs4 as bs import urllib.request import re import nltk import numpy as np import random import string |
In the script above we import the beautifulsoup4 library for parsing the webpage, the urllib library to make a connection to a remote webpage, the re library for performing regex operation, the nltk library for natural language processing, and the numpy library for basic array operations. The random library is used for random number generation. We will see how we use it later in the article. And finally, the string library is used for string manipulation.
Scraping and Preprocessing the Wikipedia Article
Once we imported the required libraries, we are good to go and scrape the article from Wikipedia. As I said earlier, our chatbot will be able to answer questions related to global warming. To develop such a chatbot, we need a dataset that contains information about global warming. One of the sources of such information is the Wikipedia article on global warming. The following script scrapes the article and extracts paragraphs from the article.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
raw_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming') raw_data = raw_data.read() html_data = bs.BeautifulSoup(raw_data,'lxml') all_paragraphs =html_data.find_all('p') article_content = "" for p in all_paragraphs: article_content += p.text article_content = article_content.lower() |
In the script above we first use the urlopen function from the urllib.request module to fetch raw data from Wikipedia. Next, we use the read method to read the data. The data retrieved using the urlopen function is in binary format, to convert it into HTML format, we can use the BeautifulSoup class and pass it our raw data along with the lxml object, as parameters. From the HTML data, we need to extract the paragraphs, we can do so using the find_all method. We need to pass p as the parameter to the method which stands for paragraphs. Finally, we joined all the paragraphs and converted the final text into lowercase.
We will remove numbers from our dataset and replace multiple empty spaces with single space. This step is optional, you can skip it.
1 2 3 |
article_content = re.sub(r'\[[0-9]*\]', ' ', article_content ) article_content = re.sub(r'\s+', ' ', article_content ) |
The next step is to tokenize the article into sentences and words. Tokenizations simply refers to splitting the article.
The following script tokenizes the article into sentences:
1 2 |
sentence_list = nltk.sent_tokenize(article_content) |
And the following script tokenizes the article into words:
1 2 |
article_words= nltk.word_tokenize(article_content ) |
Lemmatization and Punctuation Removal
Lemmatization refers to reducing the word to its root form, as available in the dictionary. For instance, the lemmatized version of the word eating will be eat , better will be good , medium will be media and so on.
Lemmatization helps find similarity between the words since similar words can be used in different tense and different degrees. Lemmatizing them makes them uniform.
Similarly, we will remove punctuations from our text because punctuations do not convey any meaning and if we do not remove them, they will also be treated as tokens.
We will use NLTK’s punkt and wordnet modules for punctuation removal. We can then use the WordNetLemmatizer object from the nltk.stem module for lemmatizing the words.
Look at the following script:
1 2 3 4 5 6 7 8 9 10 |
lemmatizer = nltk.stem.WordNetLemmatizer() def LemmatizeWords(words): return [lemmatizer.lemmatize(word) for word in words] remove_punctuation= dict((ord(punctuation), None) for punctuation in string.punctuation) def RemovePunctuations(text): return LemmatizeWords(nltk.word_tokenize(text.lower().translate(remove_punctuation))) |
In the script above two helper functions, LemmatizeWords and RemovePunctuations have been defined. The RemovePunctuations function accepts a text string, perform lemmatization on the string by passing it to LemmatizeWords function which lemmatizes the words. Punctuations are also removed from the text.
Handling Greetings
The Wikipedia article doesn’t contain any text for handling greetings, however, we want our chatbot to reply to greetings. For that, we will create a function that handles greetings. Basically, we will create two lists with different types of greeting messages. The user input will be checked against the words in one of the greeting lists, if the user input contains a word that is in the first greeting list, the response will be randomly chosen from the second greeting list. The following script does that:
1 2 3 4 5 6 7 8 9 |
greeting_input_texts = ("hey", "heys", "hello", "morning", "evening","greetings",) greeting_replie_texts = ["hey", "hey hows you?", "*nods*", "hello there", "ello", "Welcome, how are you"] def reply_greeting(text): for word in text.split(): if word.lower() in greeting_input_texts: return random.choice(greeting_replie_texts) |
Response Generation
Next, we need to create a method for general response generation. To do so we need to convert our words to vectors or numbers and then apply cosine similarity to find the similar vectors. The intuition behind this approach is that the response words should have the highest cosine similarity with user input words. To convert word to vectors we will use TF-IDF approach . We can use TfidfVectorizer from the sklearn.feature_extraction.text module to convert words to their TF-IDF counterparts. Similarly, to find the cosine similarity, the cosine_similarity method from the sklearn.metrics.pairwise class can be used. The following script imports these modules:
1 2 3 |
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity |
The following function is used for response generation:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
def give_reply(user_input): chatbot_response='' sentence_list.append(user_input) word_vectors = TfidfVectorizer(tokenizer=RemovePunctuations, stop_words='english') vecrorized_words = word_vectors.fit_transform(sentence_list) similarity_values = cosine_similarity(vecrorized_words[-1], vecrorized_words) similar_sentence_number =similarity_values.argsort()[0][-2] similar_vectors = similarity_values.flatten() similar_vectors.sort() matched_vector = similar_vectors[-2] if(matched_vector ==0): chatbot_response=chatbot_response+"I am sorry! I don't understand you" return chatbot_response else: chatbot_response = chatbot_response +sentence_list[similar_sentence_number] return chatbot_response |
The above function simply takes user input as a parameter, lemmatizes, removes punctuations, and then create TFIDF vectors from the words in the sentence. TFIDF vectors for the already existing sentences in the article is also created. Next, cosine similarity between vectors of the words in the sentence entered by the user and the existing sentences is found and the sentence with the highest cosine similarity is returned as a response. In case if no cosine similarity is found between user input and any sentence in the article, the response is generated that the sentence is not understood.
Interacting with User
Now we have created a method for user interaction. We need to create logic to interact with the user. Look at the following method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
continue_discussion=True print("Hello, I am a chatbot, I will answer your queries regarding global warming:") while(continue_discussion==True): user_input = input() user_input = user_input .lower() if(user_input !='bye'): if(user_input =='thanks' or user_input =='thank you very much' or user_input =='thank you'): continue_discussion=False print("Chatbot: Most welcome") else: if(reply_greeting(user_input)!=None): print("Chatbot: "+reply_greeting(user_input)) else: print("Chatbot: ",end="") print(give_reply(user_input)) sentence_list.remove(user_input) else: continue_discussion=False print("Chatbot: Take care, bye ..") |
In the script above, we set a flag continue_discussion to True. Next, execute a while loop inside which we ask the user to input his/her questions regarding global warming. The loop executes until the continue_discussion flag is set to True. If the user input is equal to the string ‘bye’, the loop terminates by setting continue_discussion flag to False . Else if the user input contains words like thank ‘thanks’, ‘thank you very much’ or ‘thank you’ the response generated will be ‘Chatbot: Most welcome’. If the user input contains a greeting, the response generated will contain greeting. Finally, if the user input doesn’t contain ‘bye’ or ‘thank you’ words or greetings, the user input is sent to give_reply function that we created in the last section, the function returns an appropriate response based on cosine similarity.
If you run the above script, you should see a text box asking you for any question regarding global warming, based on the question, a response will be generated. Screenshot of the output can be seen below:
From the output, you could see that I entered the question: “What is global warming” and I received a very good response from the chatbot. You can ignore the warning. It is appearing because the stop words from NLTK doesn’t contain words like “ha”, “le”, “u” etc.
Complete Code for the Application
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
import bs4 as bs import urllib.request import re import nltk import numpy as np import random import string raw_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Global_warming') raw_data = raw_data.read() html_data = bs.BeautifulSoup(raw_data,'lxml') all_paragraphs =html_data.find_all('p') article_content = "" for p in all_paragraphs: article_content += p.text article_content = article_content.lower()# converts to lowercase article_content = re.sub(r'\[[0-9]*\]', ' ', article_content ) article_content = re.sub(r'\s+', ' ', article_content ) sentence_list = nltk.sent_tokenize(article_content) article_words= nltk.word_tokenize(article_content ) nltk.download('punkt') nltk.download('wordnet') lemmatizer = nltk.stem.WordNetLemmatizer() def LemmatizeWords(words): return [lemmatizer.lemmatize(word) for word in words] remove_punctuation= dict((ord(punctuation), None) for punctuation in string.punctuation) def RemovePunctuations(text): return LemmatizeWords(nltk.word_tokenize(text.lower().translate(remove_punctuation))) greeting_input_texts = ("hey", "heys", "hello", "morning", "evening","greetings",) greeting_replie_texts = ["hey", "hey hows you?", "*nods*", "hello there", "ello", "Welcome, how are you"] def reply_greeting(text): for word in text.split(): if word.lower() in greeting_input_texts: return random.choice(greeting_replie_texts) from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity def give_reply(user_input): chatbot_response='' sentence_list.append(user_input) word_vectors = TfidfVectorizer(tokenizer=RemovePunctuations, stop_words='english') vecrorized_words = word_vectors.fit_transform(sentence_list) similarity_values = cosine_similarity(vecrorized_words[-1], vecrorized_words) similar_sentence_number =similarity_values.argsort()[0][-2] similar_vectors = similarity_values.flatten() similar_vectors.sort() matched_vector = similar_vectors[-2] if(matched_vector ==0): chatbot_response=chatbot_response+"I am sorry! I don't understand you" return chatbot_response else: chatbot_response = chatbot_response +sentence_list[similar_sentence_number] return chatbot_response continue_discussion=True print("Hello, I am a chatbot, I will answer your queries regarding global warming:") while(continue_discussion==True): user_input = input() user_input = user_input .lower() if(user_input !='bye'): if(user_input =='thanks' or user_input =='thank you very much' or user_input =='thank you'): continue_discussion=False print("Chatbot: Most welcome") else: if(reply_greeting(user_input)!=None): print("Chatbot: "+reply_greeting(user_input)) else: print("Chatbot: ",end="") print(give_reply(user_input)) sentence_list.remove(user_input) else: continue_discussion=False print("Chatbot: Take care, bye ..") |
Conclusion
Chatbots are conversational agents that can talk with the user on general topics as well as for providing specialized services. In this article, we created a very simple chatbot that generates a response based on a fixed set of rules and cosine similarity between the sentences. The chatbot answers questions related to global warming. To practice chatbot development further, I would suggest that you create a similar chatbot that answers questions related to some other topic.
I am Machine Learning and Data Science expert currently pursuing my PhD in Computer Science from Normandy University, France.