Twitter Sentiment Analysis Using TF-IDF Approach

Text Classification is a process of classifying data in the form of text such as tweets, reviews, articles, and blogs, into predefined categories. Sentiment analysis is a special case of Text Classification where users’ opinion or sentiments about any product are predicted from textual data.

In this tutorial, you will learn how to develop a Sentiment Analysis model that will use TF-IDF feature generation approach and will be capable of predicting user sentiment (i.e. view or opinion that is held or expressed) about 6 Airlines operating in the United States through analysing user tweets. You will use Python’s Scikit-Learn library for machine learning to implement the TF-IDF approach and to train our prediction model.

Tutorial Contents

Installing Required Libraries

In this tutorial, you will use multiple libraries that you have to install beforehand. To install them use pip in your Terminal or CMD as follows:

pip install numpy
pip install pandas
pip install matplotlib
pip install seaborn
pip install nltk

pip install numpy

pip install pandas

pip install matplotlib

pip install seaborn

pip install nltk

Note: If you are on Linux or Mac, you might need to use sudo before pip to avoid permissions issues.

Importing Libraries

Since we will be using Python for developing sentimental analysis model, you need to import the required libraries. The following script does that:

import numpy as np 
import pandas as pd 
import re  
import nltk 
nltk.download('stopwords')  
from nltk.corpus import stopwords

import numpy as np

import pandas as pd

import re

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

In the script above, we import “Numpy”, “Pandas”, “NLTK” and “re” libraries.

Loading Dataset

The next step, load the dataset that you will use to train your model. As we said earlier, you will be building sentimental analysis model for predicting public sentiment about 6 major airlines operating in the United States. The dataset is available freely at this Github link.

Note: To learn how to create such dataset yourself, you can check my other tutorial Scraping Tweets and Performing Sentiment Analysis.

Use the read_csv method of the Pandas library in order to load the dataset into “tweets” dataframe (*). You can either use the online URL or you can download the file and use the local path of the CSV file on your machine.

tweets = pd.read_csv("https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv")

1 2	tweets = pd.read_csv("https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv")

(*) DataFrame is a two-dimensional data structure, so data is aligned in a table-like form, i.e. in rows and columns. It is generally the most commonly used Pandas object.

To see how your dataset looks like, use the head() method of the Pandas dataframe, which returns the first 5 rows of the dataset as shown below:

tweets.head()

1 2	tweets.head()

Similarly, to find the number of rows and columns in the dataset, you can use the shape attribute as shown below:

tweets.shape

1 2	tweets.shape

In the output, you will see (14640, 15) which means that our dataset consists of 14640 rows and 15 columns. However, among the columns, we are only interested in the “airline_sentiment” column which consists of the actual category of the sentiment, and the “text” column which contains the actual text of the tweet.

Exploratory Data Analysis

In this section, you will learn how to visualize your dataset into graphs. Note that not all Python IDEs support displaying such graphs; so it is recommended you either use Jupyter Notebook or Spyder. Only in Jupyter Notebook, you need to add this extra line.

%matplotlib inline

1 2	%matplotlib inline

So before building the actual model, let’s perform some exploratory data analysis on the model. To see the number of positive, negative and neutral reviews in the form of a bar plot, execute the following script where Python’s Seaborn library is being used to draw the countplot method.

import seaborn as sns
sns.countplot(x='airline_sentiment', data=tweets)

import seaborn as sns

sns.countplot(x='airline_sentiment', data=tweets)

From the output, you can see that the number of negative reviews is much higher than the number of positive and neutral reviews.

Similarly, to see which Airline got the highest number of reviews, execute the following script.

sns.countplot(x='airline', data=tweets)

1 2	sns.countplot(x='airline', data=tweets)

From the output, you can see that the “United” Airline got the highest number of reviews whereas “Virgin America” got the lowest number of reviews.

Finally, let’s see the number of reviews of each type that each Airline received. To do so, you can again use the countplot method from the seaborn library.

 sns.countplot(x='airline', hue="airline_sentiment", data=tweets)

1 2	sns.countplot(x='airline', hue="airline_sentiment", data=tweets)

You can see that for almost all the airlines, the number of negative reviews is larger than positive and neutral reviews.

Enough of the exploratory data analysis section, let’s move to the data preprocessing section.

Data Preprocessing

First, let’s divide our dataset into features and label set. In our feature set, we will only use the text of the tweets as a feature. The corresponding label will be the sentiment of the tweet. The text column is the 10th column (column index starts from 0 in pandas) in the dataset and contains the text of the tweet. Similarly, the “airline_sentiment” is the first column and contains the sentiment. Use the “iloc” method of the pandas dataframe to create our feature set X and the label set y as shown below.

X = tweets.iloc[:, 10].values  
y = tweets.iloc[:, 1].values

X = tweets.iloc[:, 10].values

y = tweets.iloc[:, 1].values

Our dataset contains many special characters and empty spaces. You need to remove them in order to have a clean dataset. The following script does that:

processed_tweets = []

for tweet in range(0, len(X)):  
    # Remove all the special characters
    processed_tweet = re.sub(r'\W', ' ', str(X[tweet]))

    # remove all single characters
    processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)

    # Remove single characters from the start
    processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet) 

    # Substituting multiple spaces with single space
    processed_tweet= re.sub(r'\s+', ' ', processed_tweet, flags=re.I)

    # Removing prefixed 'b'
    processed_tweet = re.sub(r'^b\s+', '', processed_tweet)

    # Converting to Lowercase
    processed_tweet = processed_tweet.lower()

    processed_tweets.append(processed_tweet)

processed_tweets = []

for tweet in range(0, len(X)):

# Remove all the special characters

processed_tweet = re.sub(r'\W', ' ', str(X[tweet]))

# remove all single characters

processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)

# Remove single characters from the start

processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet)

# Substituting multiple spaces with single space

processed_tweet= re.sub(r'\s+', ' ', processed_tweet, flags=re.I)

# Removing prefixed 'b'

processed_tweet = re.sub(r'^b\s+', '', processed_tweet)

# Converting to Lowercase

processed_tweet = processed_tweet.lower()

processed_tweets.append(processed_tweet)

Let’s see what is happening in the script above. We are basically using different types of regular expression to perform text preprocessing. The regular expression re.sub(r'\W', ' ', str(X[tweet])) removes all the special characters from our tweet.

When you remove special characters, you are left with single characters that do not have any meaning. For instance, when you remove the special character from the word “Julia’s”, you are left with “Julia” and “s”. Here “s” has no meaning. The regular expression re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet) removes all the single characters except the ones at the start. To remove the single characters from the beginning of a sentence, the regex re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet) is used.

Next, as a result of removing special characters and single spaces, multiple spaces appear in the text. To remove these multiple spaces and replace them by single spaces, use re.sub(r'\s+', ' ', processed_tweet, flags=re.I) regex.

In some cases, the dataset is in byte format. In such cases, character “b” is appended at the beginning of the string. Remove that leading “b” using the re.sub(r'^b\s+', '', processed_tweet) regular expression. As the last step, convert your text into lowercase in order to maintain uniformity.

TF-IDF Scheme for Text to Numeric Feature Generation

Statistical approaches such as machine learning and deep learning work well with numerical data. However, natural language consists of words and sentences. Therefore, before you can build a sentiment analysis model, you need to convert text to numbers. Several approaches have been developed for converting text to numbers. Bag of Words, N-grams, and Word2Vec model are some of them.

In this article, we will use the Bag of Words approach with TF-IDF scheme, in order to convert text to numbers. Python’s Sklearn library comes with built-in functionalities to implement TF-IDF approach which you will see later. Here we will provide a brief insight into the TF-IDF approach.

Bag of Words

In the bag of words approach, the vocabulary of all the unique words in all the documents is formed. This vocabulary serves as a feature vector. Suppose you have three documents in our corpus S1, S2, and S3:

S1 = “It is cold outside”
S2= “The weather is cold”
S3 = “I am outside”

The vocabulary formed using the above three sentences will be:

[it, is, cold, outside, the, weather, I, am]

This vocabulary of words will be used to create feature vectors from the sentence. Let’s see how it is done. The feature vector for S1 will be:

S1= [1, 1, 1, 1, 0, 0, 0, 0]

Basically, the feature vector is created by finding if the word in the vocabulary is also found in the sentence. If a word is found in vocabulary as well as in the sentence, a one is entered in that place, else a zero will be entered. So, for S1, the first four words in the vocabulary were present in the sentence S1, you have four ones in the beginning and then four zeros.

Similarly, the feature vectors for S2 and S3 will be:

S2 = [0, 1, 1, 0, 1, 1, 0 , 0]

S3 = [0, 0, 0, 1, 0, 0, 1, 1]

TF-IDF

Now you know how bag of words approach work. Let’s now see how TF-IDF is related to bag of words.

In a simple bag of words, every word is given equal importance. The idea behind TF-IDF is that the words that occur more frequently in one document and less frequently in other documents should be given more importance as they are more useful for classification.

TF-IDF is a product of two terms: TF and IDF.

Term Frequency is equal to the number of times a word occurs in a specific document. It is calculated as:

TF = (Frequency of a word in the document)/(Total words in the document)

Inverse Document Frequency for a specific word is equal to the total number of documents, divided by the number of documents that contain that specific word. The log of the whole term is calculated to reduce the impact of the division. It is calculated as:

IDF = Log((Total number of docs)/(Number of docs containing the word))

For instance, in S1, the TF for the word “outside” will 1/4 = 0.25. Similarly, the IDF for the word “outside” in S1 will be Log(3/2) = 0.176. The TF-IDF value will be 0.25 x 0.176 = 0.044.

These are complex calculations. Fortunately, you do not have to do all these calculations. The TfidfVectorizer class from the sklearn.feature_extraction.text module can be used to create feature vectors containing TF-IDF values. Look at the following script:

from sklearn.feature_extraction.text import TfidfVectorizer  
tfidfconverter = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
X = tfidfconverter.fit_transform(processed_tweets).toarray()

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))

X = tfidfconverter.fit_transform(processed_tweets).toarray()

The attribute max_features specifies the number of most occurring words for which you want to create feature vectors. Less frequently occurring words do not play a major role in classification. Therefore we only retain 2000 most frequently occurring words in the dataset. The min_df value of 5 specifies that the word must occur in at least 5 documents. Similarly, max_df value of 0.7 percent specifies that the word must not occur in more than 70 percent of the documents. The rationale behind choosing 70 percent as the threshold is that words occurring in more than 70 percent of the documents are too common and are less likely to play any role in the classification of sentiment.

Finally, to convert your dataset into corresponding TF-IDF feature vectors, you need to call the fit_transform method on TfidfVectorizer class and pass it our preprocessed dataset.

Dividing Data to Training and Test Sets

Before building the actual sentimental analysis model, divide your dataset to the training and testing set. The model will be training on the training set and evaluated on the test set. The following script divides data into training and test sets.

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

Training and Evaluating the Text Classification Model

We divided our data into training and test sets, the next step is to train the model on the training set and evaluate its performance on the test set. Now, use the RandomForestClassifier from the sklearn.ensemble module to train your model. You can use any other classifier of your choice. To train the model, you need to call “fit” method on the classifier object and pass it the training feature set and training label set as shown below:

from sklearn.ensemble import RandomForestClassifier
text_classifier = RandomForestClassifier(n_estimators=100, random_state=0)  
text_classifier.fit(X_train, y_train)

from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=100, random_state=0)

text_classifier.fit(X_train, y_train)

To make predictions on the test set, you need to pass the test set to the “predict” method as shown below:

predictions = text_classifier.predict(X_test)

1 2	predictions = text_classifier.predict(X_test)

Finally, to evaluate the classification model that you developed, you can use confusion matrix, classification report, and accuracy as performance metrics. These metrics can be calculated using classes from sklearn.metrics module as shown below:

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions))  
print(accuracy_score(y_test, predictions))

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))

print(classification_report(y_test,predictions))

print(accuracy_score(y_test, predictions))

Results

[[1719  108   43]
 [ 325  243   46]
 [ 134   62  248]]
              precision    recall  f1-score   support

    negative       0.79      0.92      0.85      1870
     neutral       0.59      0.40      0.47       614
    positive       0.74      0.56      0.64       444

   micro avg       0.75      0.75      0.75      2928
   macro avg       0.70      0.62      0.65      2928
weighted avg       0.74      0.75      0.74      2928

0.7547814207650273

[[1719 108 43]

[ 325 243 46]

[ 134 62 248]]

precision recall f1-score support

negative 0.79 0.92 0.85 1870

neutral 0.59 0.40 0.47 614

positive 0.74 0.56 0.64 444

micro avg 0.75 0.75 0.75 2928

macro avg 0.70 0.62 0.65 2928

weighted avg 0.74 0.75 0.74 2928

0.7547814207650273

Our classifier achieved an accuracy of 75.47 percent.

Complete Code

Here is the complete code for this tutorial:

# -*- coding: utf-8 -*-
"""
Created on Mon Dec 31 14:49:57 2018

@author: Usman
"""



import numpy as np 
import pandas as pd 
import re  
import nltk  
nltk.download('stopwords')  
from nltk.corpus import stopwords  

tweets = pd.read_csv("https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv")

tweets.head()

tweets.shape


import seaborn as sns
sns.countplot(x='airline_sentiment', data=tweets)

sns.countplot(x='airline', data=tweets)

sns.countplot(x='airline', hue="airline_sentiment", data=tweets)


X = tweets.iloc[:, 10].values  
y = tweets.iloc[:, 1].values


processed_tweets = []
 
for tweet in range(0, len(X)):  
    # Remove all the special characters
    processed_tweet = re.sub(r'\W', ' ', str(X[tweet]))
 
    # remove all single characters
    processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)
 
    # Remove single characters from the start
    processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet) 
 
    # Substituting multiple spaces with single space
    processed_tweet= re.sub(r'\s+', ' ', processed_tweet, flags=re.I)
 
    # Removing prefixed 'b'
    processed_tweet = re.sub(r'^b\s+', '', processed_tweet)
 
    # Converting to Lowercase
    processed_tweet = processed_tweet.lower()
 
    processed_tweets.append(processed_tweet)
    
    

from sklearn.feature_extraction.text import TfidfVectorizer  
tfidfconverter = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))  
X = tfidfconverter.fit_transform(processed_tweets).toarray()

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


from sklearn.ensemble import RandomForestClassifier
text_classifier = RandomForestClassifier(n_estimators=100, random_state=0)  
text_classifier.fit(X_train, y_train)


predictions = text_classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions))  
print(accuracy_score(y_test, predictions))

# -*- coding: utf-8 -*-

"""

Created on Mon Dec 31 14:49:57 2018

@author: Usman

"""

import numpy as np

import pandas as pd

import re

import nltk

nltk.download('stopwords')

from nltk.corpus import stopwords

tweets = pd.read_csv("https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv")

tweets.head()

tweets.shape

import seaborn as sns

sns.countplot(x='airline_sentiment', data=tweets)

sns.countplot(x='airline', data=tweets)

sns.countplot(x='airline', hue="airline_sentiment", data=tweets)

X = tweets.iloc[:, 10].values

y = tweets.iloc[:, 1].values

processed_tweets = []

for tweet in range(0, len(X)):

# Remove all the special characters

processed_tweet = re.sub(r'\W', ' ', str(X[tweet]))

# remove all single characters

processed_tweet = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_tweet)

# Remove single characters from the start

processed_tweet = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_tweet)

# Substituting multiple spaces with single space

processed_tweet= re.sub(r'\s+', ' ', processed_tweet, flags=re.I)

# Removing prefixed 'b'

processed_tweet = re.sub(r'^b\s+', '', processed_tweet)

# Converting to Lowercase

processed_tweet = processed_tweet.lower()

processed_tweets.append(processed_tweet)

from sklearn.feature_extraction.text import TfidfVectorizer

tfidfconverter = TfidfVectorizer(max_features=2000, min_df=5, max_df=0.7, stop_words=stopwords.words('english'))

X = tfidfconverter.fit_transform(processed_tweets).toarray()

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

from sklearn.ensemble import RandomForestClassifier

text_classifier = RandomForestClassifier(n_estimators=100, random_state=0)

text_classifier.fit(X_train, y_train)

predictions = text_classifier.predict(X_test)

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

print(confusion_matrix(y_test,predictions))

print(classification_report(y_test,predictions))

print(accuracy_score(y_test, predictions))

Conclusion

The sentimental analysis is one of the major tasks in natural language process. To apply statistical techniques for sentiment analysis, you need to convert text to numbers. In this article, you saw how TF-IDF approach can be used to create numeric feature vectors from the text. Our sentimental analysis model achieves an accuracy of around 75% for sentiment prediction. I suggest that you try support vector machines, and neural network classifier to train your models and see how much accuracy do you achieve.

Usman Malik

I am Machine Learning and Data Science expert currently pursuing my PhD in Computer Science from Normandy University, France.

Rating: 4.6/5. From 11 votes.

Please wait...

2 Replies to “Twitter Sentiment Analysis Using TF-IDF Approach”

shitu says:

August 29, 2019 at 4:33 pm

slm i have problem with X = tweets.iloc[:, 10].values
y = tweets.iloc[:, 1].values

IndexError: single positional indexer is out-of-bounds plz help

Rate this item:

No votes yet.

Please wait...

Shitu says:

August 29, 2019 at 9:59 pm

slm i realize that my dataset does not have column name. plz help me on how i can scrap tweets with columns name like the one u used thanks

Rate this item:

Rating: 5.0/5. From 1 vote.

Please wait...

Twitter Sentiment Analysis Using TF-IDF Approach

Installing Required Libraries

Importing Libraries

Loading Dataset

Exploratory Data Analysis

Data Preprocessing

TF-IDF Scheme for Text to Numeric Feature Generation

Bag of Words

TF-IDF

Dividing Data to Training and Test Sets

Training and Evaluating the Text Classification Model

Results

Complete Code

Conclusion

Related

2 Replies to “Twitter Sentiment Analysis Using TF-IDF Approach”

Leave a Reply Cancel reply

Installing Required Libraries

Importing Libraries

Loading Dataset

Exploratory Data Analysis

Data Preprocessing

TF-IDF Scheme for Text to Numeric Feature Generation

Bag of Words

TF-IDF

Dividing Data to Training and Test Sets

Training and Evaluating the Text Classification Model

Results

Complete Code

Conclusion

Share this tutorial:

Related

2 Replies to “Twitter Sentiment Analysis Using TF-IDF Approach”

Leave a Reply Cancel reply

Want to learn more?