Text Classification with Pandas & Scikit

In this tutorial, we introduce one of most common NLP and Text Mining tasks, that of Document Classification. Note that while being common, it is far from useless, as the problem of classifying content is a constant hurdle we humans face every day. It is important to know basic elements of this problem since many of them can be used for other tasks such as sentiment analysis.

Author: Ivan Pereira

 

What is NLP and Document Classification?

The field of Natural Language Processing (NLP) is one of the booming areas of Artificial Intelligence. It deals with the understanding of language, something very intimate to humans, and promises to deliver some of the most needed features for an AGI (Artificial General Intelligence) be possible. Nowadays though, NLP application is not restricted to academic research only; many IT giants such as Amazon and Netflix are using extensively techniques from this field. As the field evolves, NLP is becoming more accessible for enterprises and even personal businesses to benefit from the exciting advances this field got in the last years.

There are many interesting tasks that NLP addresses, with most known ones being Document Classification, Sentiment Analysis, Natural Language Generation, among others. These tasks can be modeled as a multitude of problems, such as spam mail filtering, user review filtering, user review sentiment analyze, chatbots for answering questions and interacting with a user, and many others.

In this tutorial, we will be playing with a real-world problem, that of Document Classification of the Amazon reviews dataset. This dataset has reviews for many categories from the Amazon website, along with information of these reviews, such as usefulness and score of the product being reviewed. We will train and try to classify reviews from different categories.

The prerequisites for this tutorial is just some basic knowledge of Python programming. At the end of it we will be able to perform basic text classification.

Data loading and visualization.

The Amazon review dataset has a large corpus of reviews ranging from 10mb to 10gb, from diverse categories such as automobile-related to musical-instrument-related. To load these datasets we will install and introduce the Pandas library.

Pandas Library

The Pandas library is the standard API for dealing with data. Data is all information we have around, information that we consider of interest. Data can be numeric or textual. For our tasks of classifying text, what matters for us is mostly data in the text form. Pandas gives us tools to handle small to large text bodies, the main one being a dataframe.

Dataframes are object-based structures for data storage and manipulation. Through its methods, we can do many operations to the data. Common ones are to filter the data into smaller sets, to add new data or dataframes to it, and perform data exchanges with other dataframes. We will explore some of these operations soon.

Lastly, Pandas has a good and up-to-date documentation, so we recommend you to check it out. Just a pip install pandas is enough to install it and its dependencies.

Now we just need to import the library with the command:

To download the datasets we will be working on, check this link: http://jmcauley.ucsd.edu/data/amazon/ . For this tutorial, we will explore subsets of the following datasets:

– The Amazon Instant Video Dataset
– The Automotive Dataset
– The Musical Instruments Dataset
– The Office Product Dataset
– The Patio Lawn and Garden Dataset

To load them into dataframes, the website provides us with the following functions:

These functions read the downloaded datasets and return them encapsulated in the pandas dataframe. We use them to load the 5 datasets (they need to be in the same folder as this script) into dataframes:

For this tutorial though, we will work only with the first thousand reviews from each of the categories. We see this number as significant for an introduction since it does not need too much RAM memory for processing. The following code shows this process:

All the review text up to 1000 from each dataframe is stored in the frames list. In the last line, these dataframes are concatenated to one only, being set a key to identify each of the datasets.

Lastly, we create labels for each review in the dataframe corresponding to its category (0,1,2,3,4).

We finish by cleaning the loaded dataframes, since we already extracted the firsts thousands we will work on.

You can check the datasets through the keys defined in the concatenation process. For that, we use the location method of pandas, loc. By defining the key, we have access to only those reviews of the corresponding category. You can print and check its output.

Output:
0 I had big expectations because I love English …
1 I highly recommend this series. It is a must f…
2 This one is a real snoozer. Don’t believe anyt…
3 Mysteries are interesting. The tension betwee…
4 This show always is excellent, as far as briti…
5 I discovered this series quite by accident. Ha…
6 It beats watching a blank screen. However, I j…
7 There are many episodes in this series, so I p…
8 This is the best of the best comedy Stand-up. …
9 Not bad. Didn’t know any of the comedians but…
10 Funny, interesting, a great way to pass time. …
11 I love the variety of comics. Great for dinne…
12 comedy is a matter of taste. this guy was a li…
13 if this had to do with Dat Phan, he was hilari…
14 Watched it for Kevin Hart and only Kevin Hart!…
15 he’s OK. His humor consists mainly of varying …
16 some comedians are very good, some not so good…
17 I only watched the Wanda Sykes portion of this…
18 Enjoyed some of the comedians, it was a joy to…
19 All the comedians are hilarious. I have seen t…
20 There were some good entertainers, and some ar…
21 Very funny. Some of the comedians were funnier…
22 Great variety of good comics. Each show is jus…
23 I loved the humor of the stand-up comics featu…
24 It was fine – not my favorite but a comic on h…
25 It is nice to see some of the more popular com…
26 This is a cute series, and I did watch two epi…
27 Season 2 of It’s Always Sunny In Philadelphia …
28 Each episode gives me more entertainment than …
29 Got these for my son’s birthday. He says they …

970 I reviewed Season 5 in detail but I will tell …
971 Vic Mackey is one of the most real characters …
972 The hardest part about watching these episodes…
973 The Shield is one of the best tv show ever! It…
974 Like so many series do after a few years, Seas…
975 At the time this show took you by the throat a…
976 A really great show. One of the best cop drama…
977 In this season Vic is trying to be as good as …
978 Great job by Michael Chiklis and Walton Goggin…
979 With Season 6 we find Shane in a conundrum ove…
980 The Shield: Season Sixwas Shawn Ryan’s real fa…
981 I own all seven seasons of the Shield. Season …
982 Keeps you wanting to see more and more I just …
983 This season is good, but not great. That can b…
984 I gave it this rating because of how this TV s…
985 Caution: Spoilers ahead (although I personally…
986 Vic is back, baby! I’m certainly an unapologet…
987 Season 6 is very intense and full of suspense….
988 I HAVE TO ASK MYSELF CONTINUALLY IS CHIKLIS JU…
989 wow i loved this series it was jammed pack wit…
990 I have been watching this series for a while n…
991 The Shield is one show with great stories, ver…
992 I gave this show 5 starts. It keeps you wantin…
993 Have enjoyed the unexpected turns that this se…
994 Alex character in this show is good. I didnt …
995 I have always marvelled at the writing staff o…
996 The Shield – wherein it is proven that nothing…
997 I have submitted comments on previous seasons …
998 It was a shame to see this series come to an end.
999 Season 6 is a slap in the face for those who f…
Name: reviewText, Length: 1000, dtype: object

If you want to define and see a range of reviews, the Python operator for lists : works just fine:

Output:
0 I had big expectations because I love English …
1 I highly recommend this series. It is a must f…
2 This one is a real snoozer. Don’t believe anyt…
3 Mysteries are interesting. The tension betwee…
4 This show always is excellent, as far as briti…
Name: reviewText, dtype: object

For individual evaluation of reviews, you just need to explicit the index. For example, we can see the first review in the video category:

Output:
“I had big expectations because I love English TV, in particular Investigative and detective stuff but this guy is really boring. It didn’t appeal to me at all.”

As we have seen, individual evaluation is handy for learning individual opinions in the reviews. But to learn overall opinions, we would need to check every one of the thousand reviews of each category, which is a little hard to do. Luckily, there are handy tools that can process large bodies of text, and give us some insight into the general opinion of the customers. We will explore here one library, Wordcloud.

WordCloud

When dealing with datasets containing many words, it is important to have some statistical analysis, so we understand better the problem we have in our hands. Tagcloud or WordCloud is a technique that has been very popular on blogs/websites to show the most popular tags that the viewers access. Nodaway it has regained life in Text Analysis, where we can gain insight by checking the most frequent words from texts.

For our datasets we will use the word_cloud library so that we can visualize the most common words from the categories. We convert each dataset to a string containing all reviews from its category and feed them to the word_cloud, that will calculate the frequency of the words.

To plot the graphics, word_cloud needs matplotlib and Pillow installed. The libraries can be installed with:
pip install wordcloud
pip install matplotlib
pip install Pillow

They will install the libraries and the dependencies. In the following code we loop through the datasets outputting the word_cloud of word frequencies:

 

We observe that for all categories there are some frequent words that represent them. For example, we see many related words such as ‘season’, ‘show’, and ‘character’ for the video category. For the automobile category there is ‘car’, and for the musical category, there is ‘guitar’,’sound’, ‘string’, etc.

We see too that the words ‘use’, ‘used’, ‘one’, ‘will’ and others are frequents, while not being too much representing of the datasets. Some or all of these words can be removed if deemed necessary, through the use of stop words, explained in the next section.

We recommend that you guys try other methods from the word_cloud library, to gain additional insights about the datasets. Check the website http://amueller.github.io/word_cloud/ .

Text Pre-processing

Before using the dataset for learning methods, there is the need to preprocess it by removing words that do not help in the classification process. These words are called **stopwords**, and they are most common function words such as ‘is’, ‘the’, ‘that’, or punctuation. We usually use a list of **stopwords** already collected by someone, and in our case, we use the NLTK stopwords list with 153 items. There isn’t a perfect stopword list, in many cases, one has to manually create one. The stopword list we choose is small compared to many others out there, but it contains the most common words that might prejudice in the classification later. We can also extend the stopword list with the words we do not agree to be characteristic of the category in the word_cloud, though one needs to be cautious about it.

We added the words ‘uses’,’use’,’using’,’used’,’one’,’also’ to the list, as they don’t seem to too much of help in the document classification context, where we need words more unique to each category.

The preprocess function is defined next, and its objective is to prepare the data for the classification task. We enforce all words to be in lowercase, remove the ones containing punctuation, and filter the **stopwords**. We also do the *tokenization* process, where all text is divided into parts (words normally). We return these tokens of each dataset bundled together in a list.

We can now check how the datasets are again in the word_cloud:

Classification

For the classification step, it is really hard and inappropriate to just feed a list of tokens with thousand words to the classification model. So, before the classification, we need to transform the tokens dataset to more compress and understandable information for the model. This process is called featurization or feature extraction. We choose the method of BoW(Bag of Words) for this end.

BoW is a simple but effective method for feature extraction. To understand it we explain the essential notion: The set of reviews/texts from one user is known as ‘*document*’, and we define as ‘*vocabulary*’ the set of all distinct words from all reviews texts of all categories. The BoW model then computes a feature vector with the size of the vocabulary, containing the words frequency for its corresponding document. A simple illustration of document, vocabulary, and features vectors is as follow:

– document d1 = [‘dog’, ‘eats’, ‘meat’]
– document d2 = [‘cat’, ‘eats’, ‘fish’]

The vocabulary will be:
– vocabulary = {‘dog’,’cat’,’eats’,’meat’,’fish’}
and the features for each documents are:
– features f1 = [1, 0, 1, 1, 0]
– features f2 = [0, 1, 1, 0, 1]

So in f1, we have the first, third and fourth elements from the dictionary active (dog, eats and meat respectively). To be able to extract features from the documents, the Bow model just has to be a dictionary that saves the indexes of the vocabulary, so we know which position values in the feature vector we need to increment. The BoW for the last example would be :

– {‘dog’:0 , ‘cat’:1, ‘eats’:2, ‘meat’:3, ‘fish’:4}

The code for building a basic Bow model is shown as:

So, to extract a feature vector from one of the tokens we got in the preprocessing step, we need to allocate a list with the size of the vocabulary (that is the size of the BoW dictionary), and increment the positions our BoW model gives to us.

So, the complete algorithm for extraction of features for all our data is described next. The preprocessing is applied to all data, and for each tokenized review, we featurize it and append to the batch. This process repeats till there are no more reviews text tokens.

In the end we will work only with the batch of features for the classification step. For that, we make use of a handy library for text processing and NLP: Scikit learn.

Scikit learn

Scikit is one of the standard tools for text processing, NLP, and Machine learning. It eases the development of NLP applications and has a plethora of Machine Learning models and tools. For this tutorial, we will train some basic models with the *fit* method, and test with the *predict* method.

To install it just do pip install sklearn , it uses as dependencies the numpy and scipy libraries, that should be automatically installed together.

We divide the batch of features along with the created labels through the train_test_split, that give us a train and test set, along with the corresponding labels.

So we will use the train set to adjust our model through the fit method, and check the prediction accuracy soon after.

We see that the classification model *Perceptron* give us an accuracy of 91%. Not bad at all, since we have fiddled little with parameters of the models.

Lastly we show the usefulness of the scikit library, by presenting the complete document classification code doing heavy use of this library.

Output:
n_samples: 3350, n_features: 16587
accuracy: 0.946

The best results from the models we tried come from the Multinomial Naives Bayes method, resulting in 94,6% of accuracy. You guys can try other models to see if it can get a better result.

And that’s all for now! We will in the future approach with more depth how to deal with bigger datasets for many NLP problems.

Share this:

Python Programmer, Computer Scientist, Researcher at Federal University of Maranhão, Brazil, and member of the Intelligent Distributed Systems Laboratory, who loves sharing what he knows. Among the topics Ivan is interested in are: Machine Learning (ML), Reinforcement Learning (RL), Game Theory (GT), Natural Language Processing (NLP), Computer Vision (CV), Time Series (TS), and other Artificial Intelligence (AI) related topics.

Rating: 4.0/5. From 4 votes.
Please wait...

Leave a Reply