Extracting Facebook Posts & Comments with BeautifulSoup & Requests

Facebook is the biggest social network of our times, containing a lot of valuable data that can be useful in so many cases. Imagine being able to extract this data and use it as your project’s dataset.

In this tutorial, you are going to use Python to extract data from any Facebook profile or page. The data that you will be extracting from a predefined amount of posts is:

  • Post URL
  • Post text
  • Post media URL

You will be extracting comments from posts as well and from each comment:

  • Profile name
  • Profile URL
  • Comment text

Of course, there is plenty more data that  can be extracted from Facebook but for this tutorial that will be enough.

Python Packages

For this tutorial, you will need the following Python packages:

  • requests
  • re
  • json
  • time
  • logging
  • collections
  • bs4 (BeautifulSoup)

Remember to install these packages on a Python Virtual Environment for this project alone, it is a better practice.

Scraping Facebook with Requests

As you may know, Facebook is pretty loaded of JavaScript but the requests  package does not render JavaScript; it only allows you to make simple web requests like GET and POST.

Important: In this tutorial, you will be scraping and crawling the mobile version of Facebook since it will allow you to extract the needed data with simple requests.

How will the script crawl and scrape Facebook mobile?

First of all you need to take into account what the script will be exactly doing, the script will:

  1. Receive a list of Facebook profiles URLs from a file.
  2. Receive credentials from a file to make a login using requests  package.
  3. Make a login using a Session object from requests  package.
  4. For each profile URL we are going to extract data from a predefined amount of posts.

The script will look like this on its main function:

You are using the logging  package to put some log messages on the script execution so you know what the script is actually doing.

Then you define a base_url  that will be the Facebook mobile URL.

After extracting the input data from files you make the login calling the function make_login  that you will be defining shortly.

Then for each profile URL on out input data you are going to scrape the data from a specific amount of posts using the crawl_profile function.

Receiving the Input Data

As it is stated previously, the script will need to receive data from 2 different sources:  a file containing profiles URLs and another one containing credentials from a Facebook account to make the login. Let’s define a function that will allow you to extract this data from JSON files:

This function will allow you to extract data formatted in JSON and convert it in a Python object.

The files profiles_urls.json  and credentials.json  are the ones that will contain the input data that the script needs.

profiles_urls.json  :

credentials.json  :

You will need to replace the profiles URLs that you want to extract data from and the Facebook account’s credentials form the login.

Logging into Facebook

To make the login you will need to inspect the Facebook main page (mobile.facebook.com) on its mobile version to know the URL of the form to make the login.

If we do a right click on the “Log In” button you can get to the form to which we have to send the credentials :

The URL from the form element with the id="login_form"  is the one you need to make the login. Let’s define the function that will help you with this task :

Using the action URL from the form element you can make a POST request with Python’s requests  package. If our response is OK is because you have logged in successfully, else you wait a little and try again.

Crawling a Facebook Profile/Page

Once you are logged in, you need to crawl the Facebook profile or page URL in order to extract its public posts.

Fist you save the result of the get_bs  function into the profile_bs  variable. get_bs  function receives a Session object and a url variable:

The get_bs  function will make a GET request using the Session object, if the request code is OK then we return a BeautifulSoup  object created with the response we get.

Let’s break down this  crawl_profile  function:

  1. Once you have the profile_bs  variable, you define variables for the number of posts scraped, the posts and the posts id.
  2. Then you open a while  loop that will iterate always that the n_scraped_posts  variable is less than post_limit  variable.
  3. Inside this while loop you try to find the HTML element that holds all of the elements where the posts are. If the Facebook URL is a Facebook page, then the posts will be on the element with the id='recent'  but if the Facebook URL is a person’s profile, then the posts will be on the element with the id='structured_composer_async_container' .
  4. Once you know the elements in which the posts are, you can extract theirs URLs.
  5. Then, for each post URL that you have discovered, you are going to call the scrape_post  function and append that result to the scraped_posts  list.
  6. If you have reached the amount of posts that you predefined, then you break the while  loop.

Scraping Data from Facebook Posts

Not let’s take a look at the function that will allow you to start the real scraping:

This function starts creating an OrderedDict  object that will be the one who holds the post data:

  • Post URL
  • Post text
  • Post media URL
  • Comments

First you need the post HTML code in a BeautifulSoup  object so  use get_bs  function for that.

Since you already know the post URL at this point you just need to add it to the post_data  object.

To extract the post text you need to find the post main element, as follows:


You look for the div containing all the text, but this element can contain several <p>  tags containing text so you iterate over all of them and extract its text.

After that you extract the post media URL. Facebook posts contains either images or video or even it could be only text:

Finally you call the function extract_comments  to extract the remaining data:

Extracting Facebook Comments

This function is the larger for this tutorial,  here you iterate over a while loop until there are no more comments to be extracted:

You need to be aware if you are extracting the first page of comments or the following pages so you define a first_comment_page  variable as True.

You look if there is a “View More Comments” link, this will tell us if you are going to keep iterating over the loop or not:

In the main loop of the function, first you are going to check the value of first_comment_page , if it is True, then you extract the comments from that current page, else you make a requests to the “View More Comments” URL:

After this you select all the HTML elements that contain the comments. You need to do a second click on any comment, you will see that each comment is inside a div with a 17-digit ID:

Knowing this you can select all the elements as follow:

If you cannot find elements, that means that there are not elements. Now, for each comment you are going to create an OrderedDict  object where you will save all the data from that comment:

Inside this loop you are going to extract the comment text, looking for the HTML element that contains the text, as in the text of the post, you need to find all the elements that contains strings and add each string to a list:

Next, you need the media URL:

After you got this data you need the profile name and profile URL, these you can find as follows:

Once you have all the data you can get from a comment, you add that data to the list of comments. Next you need to check if there is a “Show more comments” link:

The loop that is extracting the comments will stop if it cannot find any more comments and the loop extracting the posts data will stop after it reach the post limit that you have given it.

Complete Code

Running the Script

You can run the script by running the following command in your Terminal or CMD:

After completion you will have a JSON file containing the data extracted:

 

Conclusion

This may seem like a simple script, but it has its trick to master; you need to have experience with different subjects like: Regular expressions, requests and BeautifulSoup. We hope you have learn more about scraping in this post, as a practice you can try to extract the same data using different selectors or even extract the amount of reactions that a post have.

 

News API: Extracting News Headlines and Articles

News plays an essential role in our daily life. Whether you want to create your own news website, or carry out a data analysis project, there is often a need to fetch different types of news articles or headlines to aggregate the news from different sources at one place or analyze them. Applications can be many, and fortunately, there is a way to retrieve the news articles from the web, from different sources and the same time.

In this tutorial you will learn how to extract news headlines and articles using the News API and save them to a CSV file.

Continue reading “News API: Extracting News Headlines and Articles”

Create a Translator Using Google Sheets API & Python

Spreadsheets are among the most popular office utilities in the world. Almost all professions use spreadsheets for a wide of ranger reasons, from tallying numbers and displaying them in graphs to doing unit conversions, just to mention a few.

Google Sheets is one of the more popular spreadsheet applications available today. Backed up by the Google platform, it has some nifty features that make it stand from its competitors.

In this tutorial,  you will learn how to use the power of Google Sheets API and Python to build a simple language translator.

Continue reading “Create a Translator Using Google Sheets API & Python”

Chatbot Development with Python NLTK

Chatbots are intelligent agents that engage in a conversation with the humans in order to answer user queries on a certain topic. Amazon’s Alexa, Apple’s Siri and Microsoft’s Cortana are some of the examples of chatbots.

Depending upon the functionality, chatbots can be divided into three categories: General purpose chatbots, task-oriented chatbots, and hybrid chatbots. General purpose chatbots are the chatbots that conduct a general discussion with the user (not on any specific topic). Task-oriented chatbots, on the other hand, are designed to perform specialized tasks, for example, to serve as online ticket reservation system or pizza delivery system, etc. Finally, hybrid chatbots are designed for both general and task-oriented discussions.

Continue reading “Chatbot Development with Python NLTK”

Scraping Tweets and Performing Sentiment Analysis

Sentiment Analysis is a special case of text classification where users’ opinions or sentiments regarding a product are classified into predefined categories such as positive, negative, neutral etc.  Public sentiments can then be used for corporate decision making regarding a product which is being liked or disliked by the public.

Both rule-based and statistical techniques have been developed for sentimental analysis.  With the advancements in Machine Learning and natural language processing techniques, Sentiment Analysis techniques have improved a lot.

In this tutorial, you will see how Sentiment Analysis can be performed on live Twitter data. The tutorial is divided into two major sections: Scraping Tweets from Twitter and Performing Sentiment Analysis.

Continue reading “Scraping Tweets and Performing Sentiment Analysis”

Twitter Sentiment Analysis Using TF-IDF Approach

Text Classification is a process of classifying data in the form of text such as tweets, reviews, articles, and blogs, into predefined categories. Sentiment analysis is a special case of Text Classification where users’ opinion or sentiments about any product are predicted from textual data.

In this tutorial, you will learn how to develop a Sentiment Analysis model that will use TF-IDF feature generation approach and will be capable of predicting user sentiment (i.e. view or opinion that is held or expressed) about 6 Airlines operating in the United States through analysing user tweets. You will use Python’s  Scikit-Learn library  for machine learning to implement the TF-IDF approach and to train our prediction model.

Continue reading “Twitter Sentiment Analysis Using TF-IDF Approach”

Postman REST API Client: Getting Started

REST technology is generally preferred to the more robust Simple Object Access Protocol (SOAP) technology because REST leverages less bandwidth, making it more suitable for internet usage.

REST APIs are all around us these days. Almost every major service provider on the internet  provides some kind of REST API. There are so many REST clients available that can be used to interact with these APIs and test requests before writing your code. Postman, is one of the world’s leading API Development Environment (ADE) with so many features baked in.

In this tutorial, you are going to learn how to use Postman to make API calls with and without authorization.

Continue reading “Postman REST API Client: Getting Started”

Twitter API: Extracting Tweets with Specific Phrase

Twitter has been a good source for Data Mining. Many data scientists and analytics companies collect tweets and analyze them to understand people’s opinion about some matters.

In this tutorial, you will learn how to use Twitter API and Python Tweepy library to search for a word or phrase and extract tweets that include it and print the results.

Note: This tutorial is different from our other Twitter API tutorial in that the current one uses Twitter Streaming API which fetches live tweets while the other tutorial uses the cursor method to search existing tweets. You can use the cursor to specify the language and tweet limit and you can also filter retweets using cursor.

Continue reading “Twitter API: Extracting Tweets with Specific Phrase”

Searching GitHub Using Python & GitHub API

GitHub is a web-based hosting service for version control using Git. It is mostly used for storing and sharing computer source code. It offers all of the distributed version control and source code management functionality of Git as well as adding its own features.

GitHub stores more than 3 million repositories with more than 1.7 million developers using it daily. With so much data, it can be quite daunting at first to find information one needs or do repetitive tasks, and that is when GitHub API comes handy.

In this tutorial, you are going to learn how to use GitHub API to search for repositories and files that much particular keywords(s) and retrieve their URLs using Python. You will learn also how to download files or a specific folder from a GitHub repository.

Continue reading “Searching GitHub Using Python & GitHub API”

Amazon S3 with Python Boto3 Library

Amazon S3 is the Simple Storage Service provided by Amazon Web Services (AWS) for object based file storage. With the increase of Big Data Applications and cloud computing, it is absolutely necessary that all the “big data” shall be stored on the cloud for easy processing over the cloud applications.

In this tutorial, you will learn how to use Amazon S3 service via the Python library Boto3. You will learn how to create S3 Buckets and Folders, and how to upload and access files to and from S3 buckets. Eventually, you will have a Python code that you can run on EC2 instance and access your data on the cloud while it is stored on the cloud.

Continue reading “Amazon S3 with Python Boto3 Library”