Twitter Sentiment Analysis Using TF-IDF Approach

Text Classification is a process of classifying data in the form of text such as tweets, reviews, articles, and blogs, into predefined categories. Sentiment analysis is a special case of Text Classification where users’ opinion or sentiments about any product are predicted from textual data.

In this tutorial, you will learn how to develop a Sentiment Analysis model that will use TF-IDF feature generation approach and will be capable of predicting user sentiment (i.e. view or opinion that is held or expressed) about 6 Airlines operating in the United States through analysing user tweets. You will use Python’s  Scikit-Learn library  for machine learning to implement the TF-IDF approach and to train our prediction model.

Installing Required Libraries

In this tutorial, you will use multiple libraries that you have to install beforehand. To install them use pip in your Terminal or CMD as follows:

Note: If you are on Linux or Mac, you might need to use sudo before pip to avoid permissions issues.

Importing Libraries

Since we will be using Python for developing sentimental analysis model, you need to import the required libraries. The following script does that:

In the script above, we import “Numpy”, “Pandas”, “NLTK” and “re” libraries.

Loading Dataset

The next step, load the dataset that you will use to train your model. As we said earlier, you will be building sentimental analysis model for predicting public sentiment about 6 major airlines operating in the United States. The dataset is available freely at this Github link.

Note: To learn how to create such dataset yourself, you can check my other tutorial Scraping Tweets and Performing Sentiment Analysis.

Use the  read_csv method of the Pandas library in order to load the dataset into “tweets” dataframe (*). You can either use the online URL or you can download the file and use the local path of the CSV file on your machine.

 

(*) DataFrame is a two-dimensional data structure, so data is aligned in a table-like form, i.e. in rows and columns. It is generally the most commonly used Pandas object.

 

To see how your dataset looks like, use the  head() method of the Pandas dataframe, which returns the first 5 rows of the dataset as shown below:

 

Similarly, to find the number of rows and columns in the dataset, you can use the shape attribute as shown below:

In the output, you will see (14640, 15) which means that our dataset consists of 14640 rows and 15 columns. However, among the columns, we are only interested in the “airline_sentiment” column which consists of the actual category of the sentiment, and the “text” column which contains the actual text of the tweet.

Exploratory Data Analysis

In this section, you will learn how to visualize your dataset into graphs. Note that not all Python IDEs support displaying such graphs; so it is recommended you either use Jupyter Notebook or Spyder.  Only in Jupyter Notebook, you need to add this extra line.

 

So before building the actual model, let’s perform some exploratory data analysis on the model. To see the number of positive, negative and neutral reviews in the form of a bar plot, execute the following script where Python’s Seaborn library is being used to draw  the countplot method.

From the output, you can see that the number of negative reviews is much higher than the number of positive and neutral reviews.

 

Similarly, to see which Airline got the highest number of reviews,  execute the following script.

From the output, you can see that the “United” Airline got the highest number of reviews whereas “Virgin America” got the lowest number of reviews.

 

Finally, let’s see the number of reviews of each type that each Airline received. To do so, you can again use the countplot method from the seaborn library.

You can see that for almost all the airlines, the number of negative reviews is larger than positive and neutral reviews.

 

Enough of the exploratory data analysis section, let’s move to the data preprocessing section.

Data Preprocessing

First, let’s divide our dataset into features and label set. In our feature set, we will only use the text of the tweets as a feature. The corresponding label will be the sentiment of the tweet. The text column is the 10th column (column index starts from 0 in pandas) in the dataset and contains the text of the tweet. Similarly, the “airline_sentiment” is the first column and contains the sentiment. Use the “iloc” method of the pandas dataframe to create our feature set X and the label set y as shown below.

Our dataset contains many special characters and empty spaces. You need to remove them in order to have a clean dataset. The following script does that:

Let’s see what is happening in the script above. We are basically using different types of regular expression to perform text preprocessing. The regular expression re.sub(r'\W', ' ', str(X[tweet]))   removes all the special characters from our tweet.

When you remove special characters,  you are left with single characters that do not have any meaning. For instance, when you remove the special character from the word “Julia’s”, you are left with “Julia” and “s”. Here “s” has no meaning. The regular expression  re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)  removes all the single characters except the ones at the start. To remove the single characters from the beginning of a sentence, the regex re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet)   is used.

Next, as a result of removing special characters and single spaces, multiple spaces appear in the text. To remove these multiple spaces and replace them by single spaces, use re.sub(r'\s+', ' ', processed_tweet, flags=re.I)  regex.

In some cases, the dataset is in byte format. In such cases, character “b” is appended at the beginning of the string. Remove that leading “b” using the re.sub(r'^b\s+', '', processed_tweet)  regular expression. As the last step, convert your text into lowercase in order to maintain uniformity.

TF-IDF Scheme for Text to Numeric Feature Generation

Statistical approaches such as machine learning and deep learning work well with numerical data. However, natural language consists of words and sentences. Therefore, before you can build a sentiment analysis model, you need to convert text to numbers. Several approaches have been developed for converting text to numbers. Bag of Words, N-grams, and Word2Vec model are some of them.

In this article, we will use the Bag of Words approach with TF-IDF scheme, in order to convert text to numbers. Python’s Sklearn library comes with built-in functionalities to implement TF-IDF approach which you will see later. Here we will provide a brief insight into the TF-IDF approach.

Bag of Words

In the bag of words approach, the vocabulary of all the unique words in all the documents is formed. This vocabulary serves as a feature vector. Suppose you have three documents in our corpus S1, S2, and S3:

  • S1 = “It is cold outside”
  • S2= “The weather is cold”
  • S3 = “I am outside”

The vocabulary formed using the above three sentences will be:

[it, is, cold, outside, the, weather, I,  am]

This vocabulary of words will be used to create feature vectors from the sentence. Let’s see how it is done. The feature vector for S1 will be:

S1= [1, 1, 1, 1, 0, 0, 0, 0]

Basically, the feature vector is created by finding if the word in the vocabulary is also found in the sentence. If a word is found in vocabulary as well as in the sentence, a one is entered in that place, else a zero will be entered. So, for S1, the first four words in the vocabulary were present in the sentence S1, you have four ones in the beginning and then four zeros.

Similarly, the feature vectors for S2 and S3 will be:

S2 = [0, 1, 1, 0, 1, 1, 0 , 0]

S3 = [0, 0, 0, 1, 0, 0, 1, 1]

TF-IDF

Now you know how bag of words approach work. Let’s now see how TF-IDF is related to bag of words.

In a simple bag of words, every word is given equal importance. The idea behind TF-IDF is that the words that occur more frequently in one document and less frequently in other documents should be given more importance as they are more useful for classification.

TF-IDF is a product of two terms: TF and IDF.

Term Frequency is equal to the number of times a word occurs in a specific document. It is calculated as:

TF  = (Frequency of a word in the document)/(Total words in the document)

Inverse Document Frequency for a specific word is equal to the total number of documents, divided by the number of documents that contain that specific word. The log of the whole term is calculated to reduce the impact of the division. It is calculated as:

IDF = Log((Total number of docs)/(Number of docs containing the word))

For instance, in S1, the TF for the word “outside” will 1/4 = 0.25. Similarly, the IDF for the word “outside” in S1 will be Log(3/2) = 0.176. The TF-IDF value will be 0.25 x 0.176 = 0.044.

These are complex calculations. Fortunately, you do not have to do all these calculations. The TfidfVectorizer  class from the   sklearn.feature_extraction.text module can be used to create feature vectors containing TF-IDF values. Look at the following script:

The attribute max_features  specifies the number of most occurring words for which you want to create feature vectors. Less frequently occurring words do not play a major role in classification. Therefore we only retain 2000 most frequently occurring words in the dataset. The min_df  value of 5 specifies that the word must occur in at least 5 documents. Similarly, max_df  value of 0.7 percent specifies that the word must not occur in more than 70 percent of the documents. The rationale behind choosing 70 percent as the threshold is that words occurring in more than 70 percent of the documents are too common and are less likely to play any role in the classification of sentiment.

Finally, to convert your dataset into corresponding TF-IDF feature vectors, you need to call the fit_transform  method on TfidfVectorizer  class and pass it our preprocessed dataset.

Dividing Data to Training and Test Sets

Before building the actual sentimental analysis model, divide your dataset to the training and testing set. The model will be training on the training set and evaluated on the test set. The following script divides data into training and test sets.

Training and Evaluating the Text Classification Model

We divided our data into training and test sets, the next step is to train the model on the training set and evaluate its performance on the test set. Now, use the RandomForestClassifier  from the sklearn.ensemble module to train your model. You can use any other classifier of your choice.  To train the model, you need to call “fit” method on the classifier object and pass it the training feature set and training label set as shown below:

 

To make predictions on the test set, you need to pass the test set to the “predict” method as shown below:

 

Finally, to evaluate the classification model that you developed, you can use confusion matrix, classification report, and accuracy as performance metrics. These metrics can be calculated using classes from sklearn.metrics  module as shown below:

Results

Our classifier achieved an accuracy of 75.47 percent.

Complete Code

Here is the complete code for this tutorial:

 

Conclusion

The sentimental analysis is one of the major tasks in natural language process. To apply statistical techniques for sentiment analysis, you need to convert text to numbers. In this article, you saw how TF-IDF approach can be used to create numeric feature vectors from the text. Our sentimental analysis model achieves an accuracy of around 75% for sentiment prediction. I suggest that you try support vector machines, and neural network classifier to train your models and see how much accuracy do you achieve.

 

 

Rating: 5.0/5. From 3 votes.
Please wait...

2 Replies to “Twitter Sentiment Analysis Using TF-IDF Approach”

  1. slm i have problem with X = tweets.iloc[:, 10].values
    y = tweets.iloc[:, 1].values

    IndexError: single positional indexer is out-of-bounds plz help

    No votes yet.
    Please wait...
  2. slm i realize that my dataset does not have column name. plz help me on how i can scrap tweets with columns name like the one u used thanks

    No votes yet.
    Please wait...

Leave a Reply