Extracting Facebook Posts & Comments with BeautifulSoup & Requests

Facebook is the biggest social network of our times, containing a lot of valuable data that can be useful in so many cases. Imagine being able to extract this data and use it as your project’s dataset.

In this tutorial, you are going to use Python to extract data from any Facebook profile or page. The data that you will be extracting from a predefined amount of posts is:

  • Post URL
  • Post text
  • Post media URL

You will be extracting comments from posts as well and from each comment:

  • Profile name
  • Profile URL
  • Comment text

Of course, there is plenty more data that  can be extracted from Facebook but for this tutorial that will be enough.

Python Packages

For this tutorial, you will need the following Python packages:

  • requests
  • re
  • json
  • time
  • logging
  • collections
  • bs4 (BeautifulSoup)

Remember to install these packages on a Python Virtual Environment for this project alone, it is a better practice.

Scraping Facebook with Requests

As you may know, Facebook is pretty loaded of JavaScript but the requests  package does not render JavaScript; it only allows you to make simple web requests like GET and POST.

Important: In this tutorial, you will be scraping and crawling the mobile version of Facebook since it will allow you to extract the needed data with simple requests.

How will the script crawl and scrape Facebook mobile?

First of all you need to take into account what the script will be exactly doing, the script will:

  1. Receive a list of Facebook profiles URLs from a file.
  2. Receive credentials from a file to make a login using requests  package.
  3. Make a login using a Session object from requests  package.
  4. For each profile URL we are going to extract data from a predefined amount of posts.

The script will look like this on its main function:

You are using the logging  package to put some log messages on the script execution so you know what the script is actually doing.

Then you define a base_url  that will be the Facebook mobile URL.

After extracting the input data from files you make the login calling the function make_login  that you will be defining shortly.

Then for each profile URL on out input data you are going to scrape the data from a specific amount of posts using the crawl_profile function.

Receiving the Input Data

As it is stated previously, the script will need to receive data from 2 different sources:  a file containing profiles URLs and another one containing credentials from a Facebook account to make the login. Let’s define a function that will allow you to extract this data from JSON files:

This function will allow you to extract data formatted in JSON and convert it in a Python object.

The files profiles_urls.json  and credentials.json  are the ones that will contain the input data that the script needs.

profiles_urls.json  :

credentials.json  :

You will need to replace the profiles URLs that you want to extract data from and the Facebook account’s credentials form the login.

Logging into Facebook

To make the login you will need to inspect the Facebook main page (mobile.facebook.com) on its mobile version to know the URL of the form to make the login.

If we do a right click on the “Log In” button you can get to the form to which we have to send the credentials :

The URL from the form element with the id="login_form"  is the one you need to make the login. Let’s define the function that will help you with this task :

Using the action URL from the form element you can make a POST request with Python’s requests  package. If our response is OK is because you have logged in successfully, else you wait a little and try again.

Crawling a Facebook Profile/Page

Once you are logged in, you need to crawl the Facebook profile or page URL in order to extract its public posts.

Fist you save the result of the get_bs  function into the profile_bs  variable. get_bs  function receives a Session object and a url variable:

The get_bs  function will make a GET request using the Session object, if the request code is OK then we return a BeautifulSoup  object created with the response we get.

Let’s break down this  crawl_profile  function:

  1. Once you have the profile_bs  variable, you define variables for the number of posts scraped, the posts and the posts id.
  2. Then you open a while  loop that will iterate always that the n_scraped_posts  variable is less than post_limit  variable.
  3. Inside this while loop you try to find the HTML element that holds all of the elements where the posts are. If the Facebook URL is a Facebook page, then the posts will be on the element with the id='recent'  but if the Facebook URL is a person’s profile, then the posts will be on the element with the id='structured_composer_async_container' .
  4. Once you know the elements in which the posts are, you can extract theirs URLs.
  5. Then, for each post URL that you have discovered, you are going to call the scrape_post  function and append that result to the scraped_posts  list.
  6. If you have reached the amount of posts that you predefined, then you break the while  loop.

Scraping Data from Facebook Posts

Not let’s take a look at the function that will allow you to start the real scraping:

This function starts creating an OrderedDict  object that will be the one who holds the post data:

  • Post URL
  • Post text
  • Post media URL
  • Comments

First you need the post HTML code in a BeautifulSoup  object so  use get_bs  function for that.

Since you already know the post URL at this point you just need to add it to the post_data  object.

To extract the post text you need to find the post main element, as follows:


You look for the div containing all the text, but this element can contain several <p>  tags containing text so you iterate over all of them and extract its text.

After that you extract the post media URL. Facebook posts contains either images or video or even it could be only text:

Finally you call the function extract_comments  to extract the remaining data:

Extracting Facebook Comments

This function is the larger for this tutorial,  here you iterate over a while loop until there are no more comments to be extracted:

You need to be aware if you are extracting the first page of comments or the following pages so you define a first_comment_page  variable as True.

You look if there is a “View More Comments” link, this will tell us if you are going to keep iterating over the loop or not:

In the main loop of the function, first you are going to check the value of first_comment_page , if it is True, then you extract the comments from that current page, else you make a requests to the “View More Comments” URL:

After this you select all the HTML elements that contain the comments. You need to do a second click on any comment, you will see that each comment is inside a div with a 17-digit ID:

Knowing this you can select all the elements as follow:

If you cannot find elements, that means that there are not elements. Now, for each comment you are going to create an OrderedDict  object where you will save all the data from that comment:

Inside this loop you are going to extract the comment text, looking for the HTML element that contains the text, as in the text of the post, you need to find all the elements that contains strings and add each string to a list:

Next, you need the media URL:

After you got this data you need the profile name and profile URL, these you can find as follows:

Once you have all the data you can get from a comment, you add that data to the list of comments. Next you need to check if there is a “Show more comments” link:

The loop that is extracting the comments will stop if it cannot find any more comments and the loop extracting the posts data will stop after it reach the post limit that you have given it.

Complete Code

Running the Script

You can run the script by running the following command in your Terminal or CMD:

After completion you will have a JSON file containing the data extracted:

 

Conclusion

This may seem like a simple script, but it has its trick to master; you need to have experience with different subjects like: Regular expressions, requests and BeautifulSoup. We hope you have learn more about scraping in this post, as a practice you can try to extract the same data using different selectors or even extract the amount of reactions that a post have.

 

Rating: 4.1/5. From 20 votes.
Please wait...

Leave a Reply