Extracting Facebook Posts & Comments with BeautifulSoup & Requests

Facebook is the biggest social network of our times, containing a lot of valuable data that can be useful in so many cases. Imagine being able to extract this data and use it as your project’s dataset.

In this tutorial, you are going to use Python to extract data from any Facebook profile or page. The data that you will be extracting from a predefined amount of posts is:

Post URL
Post text
Post media URL

You will be extracting comments from posts as well and from each comment:

Profile name
Profile URL
Comment text

Of course, there is plenty more data that can be extracted from Facebook but for this tutorial that will be enough.

Tutorial Contents

Python Packages

For this tutorial, you will need the following Python packages:

requests
re
json
time
logging
collections
bs4 (BeautifulSoup)

Remember to install these packages on a Python Virtual Environment for this project alone, it is a better practice.

Scraping Facebook with Requests

As you may know, Facebook is pretty loaded of JavaScript but the requests package does not render JavaScript; it only allows you to make simple web requests like GET and POST.

Important: In this tutorial, you will be scraping and crawling the mobile version of Facebook since it will allow you to extract the needed data with simple requests.

How will the script crawl and scrape Facebook mobile?

First of all you need to take into account what the script will be exactly doing, the script will:

Receive a list of Facebook profiles URLs from a file.
Receive credentials from a file to make a login using requests package.
Make a login using a Session object from requests package.
For each profile URL we are going to extract data from a predefined amount of posts.

The script will look like this on its main function:

if __name__ == "__main__":

    logging.basicConfig(level=logging.INFO)
    base_url = 'https://mobile.facebook.com'
    session = requests.session()

    # Extracts credentials for the login and all of the profiles URL to scrape
    credentials = json_to_obj('credentials.json')
    profiles_urls = json_to_obj('profiles_urls.json')

    make_login(session, base_url, credentials)

    posts_data = None
    for profile_url in profiles_urls:
        posts_data = crawl_profile(session, base_url, profile_url, 25)
    logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))
    logging.info('[!] Saving.')
    save_data(posts_data)

if __name__ == "__main__":

logging.basicConfig(level=logging.INFO)

base_url = 'https://mobile.facebook.com'

session = requests.session()

# Extracts credentials for the login and all of the profiles URL to scrape

credentials = json_to_obj('credentials.json')

profiles_urls = json_to_obj('profiles_urls.json')

make_login(session, base_url, credentials)

posts_data = None

for profile_url in profiles_urls:

posts_data = crawl_profile(session, base_url, profile_url, 25)

logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))

logging.info('[!] Saving.')

save_data(posts_data)

You are using the logging package to put some log messages on the script execution so you know what the script is actually doing.

Then you define a base_url that will be the Facebook mobile URL.

After extracting the input data from files you make the login calling the function make_login that you will be defining shortly.

Then for each profile URL on out input data you are going to scrape the data from a specific amount of posts using the crawl_profile function.

Receiving the Input Data

As it is stated previously, the script will need to receive data from 2 different sources: a file containing profiles URLs and another one containing credentials from a Facebook account to make the login. Let’s define a function that will allow you to extract this data from JSON files:

def json_to_obj(filename):
    """Extracts data from JSON file and saves it on Python object
    """
    obj = None
    with open(filename) as json_file:
        obj = json.loads(json_file.read())
    return obj

def json_to_obj(filename):

"""Extracts data from JSON file and saves it on Python object

"""

obj = None

with open(filename) as json_file:

obj = json.loads(json_file.read())

return obj

This function will allow you to extract data formatted in JSON and convert it in a Python object.

The files profiles_urls.json and credentials.json are the ones that will contain the input data that the script needs.

profiles_urls.json :

[
    "https://mobile.facebook.com/profileURL1/",
    "https://mobile.facebook.com/profileURL2"
]

[

"https://mobile.facebook.com/profileURL1/",

"https://mobile.facebook.com/profileURL2"

]

credentials.json :

{
    "email":"username@mail.com",
    "pass":"password"
}

{

"email":"username@mail.com",

"pass":"password"

}

You will need to replace the profiles URLs that you want to extract data from and the Facebook account’s credentials form the login.

Logging into Facebook

To make the login you will need to inspect the Facebook main page (mobile.facebook.com) on its mobile version to know the URL of the form to make the login.

If we do a right click on the “Log In” button you can get to the form to which we have to send the credentials :

The URL from the form element with the id="login_form" is the one you need to make the login. Let’s define the function that will help you with this task :

def make_login(session, base_url, credentials):
    """Returns a Session object logged in with credentials.
    """
    login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\
        '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

    params = {'email':credentials['email'], 'pass':credentials['pass']}

    while True:
        time.sleep(3)
        logged_request = session.post(base_url+login_form_url, data=params)
        
        if logged_request.ok:
            logging.info('[*] Logged in.')
            break

def make_login(session, base_url, credentials):

"""Returns a Session object logged in with credentials.

"""

login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\

'%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

params = {'email':credentials['email'], 'pass':credentials['pass']}

while True:

time.sleep(3)

logged_request = session.post(base_url+login_form_url, data=params)

if logged_request.ok:

logging.info('[*] Logged in.')

break

Using the action URL from the form element you can make a POST request with Python’s requests package. If our response is OK is because you have logged in successfully, else you wait a little and try again.

Crawling a Facebook Profile/Page

Once you are logged in, you need to crawl the Facebook profile or page URL in order to extract its public posts.

def crawl_profile(session, base_url, profile_url, post_limit):
    """Goes to profile URL, crawls it and extracts posts URLs.
    """
    profile_bs = get_bs(session, profile_url)
    n_scraped_posts = 0
    scraped_posts = list()
    posts_id = None

    while n_scraped_posts < post_limit:
        try:
            posts_id = 'recent'
            posts = profile_bs.find('div', id=posts_id).div.div.contents
        except Exception:
            posts_id = 'structured_composer_async_container'
            posts = profile_bs.find('div', id=posts_id).div.div.contents

        posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] 

        for post_url in posts_urls:
            # print(post_url)
            try:
                post_data = scrape_post(session, base_url, post_url)
                scraped_posts.append(post_data)
            except Exception as e:
                logging.info('Error: {}'.format(e))
            n_scraped_posts += 1
            if posts_completed(scraped_posts, post_limit):
                break
        
        show_more_posts_url = None
        if not posts_completed(scraped_posts, post_limit):
            show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
            profile_bs = get_bs(session, base_url+show_more_posts_url)
            time.sleep(3)
        else:
            break
    return scraped_posts

def crawl_profile(session, base_url, profile_url, post_limit):

"""Goes to profile URL, crawls it and extracts posts URLs.

"""

profile_bs = get_bs(session, profile_url)

n_scraped_posts = 0

scraped_posts = list()

posts_id = None

while n_scraped_posts < post_limit:

try:

posts_id = 'recent'

posts = profile_bs.find('div', id=posts_id).div.div.contents

except Exception:

posts_id = 'structured_composer_async_container'

posts = profile_bs.find('div', id=posts_id).div.div.contents

posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')]

for post_url in posts_urls:

# print(post_url)

try:

post_data = scrape_post(session, base_url, post_url)

scraped_posts.append(post_data)

except Exception as e:

logging.info('Error: {}'.format(e))

n_scraped_posts += 1

if posts_completed(scraped_posts, post_limit):

break

show_more_posts_url = None

if not posts_completed(scraped_posts, post_limit):

show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']

profile_bs = get_bs(session, base_url+show_more_posts_url)

time.sleep(3)

else:

break

return scraped_posts

Fist you save the result of the get_bs function into the profile_bs variable. get_bs function receives a Session object and a url variable:

def get_bs(session, url):
    """Makes a GET requests using the given Session object
    and returns a BeautifulSoup object.
    """
    r = None
    while True:
        r = session.get(url)
        time.sleep(3)
        if r.ok:
            break
    return BeautifulSoup(r.text, 'lxml')

def get_bs(session, url):

"""Makes a GET requests using the given Session object

and returns a BeautifulSoup object.

"""

r = None

while True:

r = session.get(url)

time.sleep(3)

if r.ok:

break

return BeautifulSoup(r.text, 'lxml')

The get_bs function will make a GET request using the Session object, if the request code is OK then we return a BeautifulSoup object created with the response we get.

Let’s break down this crawl_profile function:

Once you have the profile_bs variable, you define variables for the number of posts scraped, the posts and the posts id.
Then you open a while loop that will iterate always that the n_scraped_posts variable is less than post_limit variable.
Inside this while loop you try to find the HTML element that holds all of the elements where the posts are. If the Facebook URL is a Facebook page, then the posts will be on the element with the id='recent' but if the Facebook URL is a person’s profile, then the posts will be on the element with the id='structured_composer_async_container' .
Once you know the elements in which the posts are, you can extract theirs URLs.
Then, for each post URL that you have discovered, you are going to call the scrape_post function and append that result to the scraped_posts list.
If you have reached the amount of posts that you predefined, then you break the while loop.

Scraping Data from Facebook Posts

Not let’s take a look at the function that will allow you to start the real scraping:

def scrape_post(session, base_url, post_url):
    """Goes to post URL and extracts post data.
    """
    post_data = OrderedDict()

    post_bs = get_bs(session, base_url+post_url)
    time.sleep(5)

    # Here we populate the OrderedDict object
    post_data['url'] = post_url

    try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []
    
    try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''
    

    try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []
    
    return dict(post_data)

def scrape_post(session, base_url, post_url):

"""Goes to post URL and extracts post data.

"""

post_data = OrderedDict()

post_bs = get_bs(session, base_url+post_url)

time.sleep(5)

# Here we populate the OrderedDict object

post_data['url'] = post_url

try:

post_text_element = post_bs.find('div', id='u_0_0').div

string_groups = [p.strings for p in post_text_element.find_all('p')]

strings = [repr(string) for group in string_groups for string in group]

post_data['text'] = strings

except Exception:

post_data['text'] = []

try:

post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']

except Exception:

post_data['media_url'] = ''

try:

post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)

except Exception:

post_data['comments'] = []

return dict(post_data)

This function starts creating an OrderedDict object that will be the one who holds the post data:

Post URL
Post text
Post media URL
Comments

First you need the post HTML code in a BeautifulSoup object so use get_bs function for that.

Since you already know the post URL at this point you just need to add it to the post_data object.

To extract the post text you need to find the post main element, as follows:

    try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []

try:

post_text_element = post_bs.find('div', id='u_0_0').div

string_groups = [p.strings for p in post_text_element.find_all('p')]

strings = [repr(string) for group in string_groups for string in group]

post_data['text'] = strings

except Exception:

post_data['text'] = []

You look for the div containing all the text, but this element can contain several <p> tags containing text so you iterate over all of them and extract its text.

After that you extract the post media URL. Facebook posts contains either images or video or even it could be only text:

    try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''

try:

post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']

except Exception:

post_data['media_url'] = ''

Finally you call the function extract_comments to extract the remaining data:

    try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []

try:

post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)

except Exception:

post_data['comments'] = []

Extracting Facebook Comments

This function is the larger for this tutorial, here you iterate over a while loop until there are no more comments to be extracted:

def extract_comments(session, base_url, post_bs, post_url):
    """Extracts all coments from post
    """
    comments = list()
    show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']
    first_comment_page = True

    logging.info('Scraping comments from {}'.format(post_url))
    while True:

        logging.info('[!] Scraping comments.')
        time.sleep(3)
        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)
        
        try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break
        
        for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))
        
        show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break
    
    return comments

def extract_comments(session, base_url, post_bs, post_url):

"""Extracts all coments from post

"""

comments = list()

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']

first_comment_page = True

logging.info('Scraping comments from {}'.format(post_url))

while True:

logging.info('[!] Scraping comments.')

time.sleep(3)

if first_comment_page:

first_comment_page = False

else:

post_bs = get_bs(session, base_url+show_more_url)

time.sleep(3)

try:

comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\

.find_all('div', id=re.compile('^\d+'))

except Exception:

pass

if len(comments_elements) != 0:

logging.info('[!] There are comments.')

else:

break

for comment in comments_elements:

comment_data = OrderedDict()

comment_data['text'] = list()

try:

comment_strings = comment.find('h3').next_sibling.strings

for string in comment_strings:

comment_data['text'].append(string)

except Exception:

pass

try:

media = comment.find('h3').next_sibling.next_sibling.children

if media is not None:

for element in media:

comment_data['media_url'] = element['src']

else:

comment_data['media_url'] = ''

except Exception:

pass

comment_data['profile_name'] = comment.find('h3').a.string

comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]

comments.append(dict(comment_data))

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))

if 'View more' in show_more_url.text:

logging.info('[!] More comments.')

show_more_url = show_more_url['href']

else:

break

return comments

You need to be aware if you are extracting the first page of comments or the following pages so you define a first_comment_page variable as True.

You look if there is a “View More Comments” link, this will tell us if you are going to keep iterating over the loop or not:

    show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']

1 2	show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']

In the main loop of the function, first you are going to check the value of first_comment_page , if it is True, then you extract the comments from that current page, else you make a requests to the “View More Comments” URL:

        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)         Value  1,229.01  Mil.Baht

if first_comment_page:

first_comment_page = False

else:

post_bs = get_bs(session, base_url+show_more_url)

time.sleep(3) Value 1,229.01 Mil.Baht

After this you select all the HTML elements that contain the comments. You need to do a second click on any comment, you will see that each comment is inside a div with a 17-digit ID:

Knowing this you can select all the elements as follow:

        try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break

try:

comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\

.find_all('div', id=re.compile('^\d+'))

except Exception:

pass

if len(comments_elements) != 0:

logging.info('[!] There are comments.')

else:

break

If you cannot find elements, that means that there are not elements. Now, for each comment you are going to create an OrderedDict object where you will save all the data from that comment:

        for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))

for comment in comments_elements:

comment_data = OrderedDict()

comment_data['text'] = list()

try:

comment_strings = comment.find('h3').next_sibling.strings

for string in comment_strings:

comment_data['text'].append(string)

except Exception:

pass

try:

media = comment.find('h3').next_sibling.next_sibling.children

if media is not None:

for element in media:

comment_data['media_url'] = element['src']

else:

comment_data['media_url'] = ''

except Exception:

pass

comment_data['profile_name'] = comment.find('h3').a.string

comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]

comments.append(dict(comment_data))

Inside this loop you are going to extract the comment text, looking for the HTML element that contains the text, as in the text of the post, you need to find all the elements that contains strings and add each string to a list:

            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass

try:

comment_strings = comment.find('h3').next_sibling.strings

for string in comment_strings:

comment_data['text'].append(string)

except Exception:

pass

Next, you need the media URL:

            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass

try:

media = comment.find('h3').next_sibling.next_sibling.children

if media is not None:

for element in media:

comment_data['media_url'] = element['src']

else:

comment_data['media_url'] = ''

except Exception:

pass

After you got this data you need the profile name and profile URL, these you can find as follows:

            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]

comment_data['profile_name'] = comment.find('h3').a.string

comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]

Once you have all the data you can get from a comment, you add that data to the list of comments. Next you need to check if there is a “Show more comments” link:

        show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))

if 'View more' in show_more_url.text:

logging.info('[!] More comments.')

show_more_url = show_more_url['href']

else:

break

The loop that is extracting the comments will stop if it cannot find any more comments and the loop extracting the posts data will stop after it reach the post limit that you have given it.

Complete Code

import requests
import re
import json
import time
import logging
import pandas
from collections import OrderedDict
from bs4 import BeautifulSoup



def get_bs(session, url):
    """Makes a GET requests using the given Session object
    and returns a BeautifulSoup object.
    """
    r = None
    while True:
        r = session.get(url)
        if r.ok:
            break
    return BeautifulSoup(r.text, 'lxml')


def make_login(session, base_url, credentials):
    """Returns a Session object logged in with credentials.
    """
    login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\
        '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

    params = {'email':credentials['email'], 'pass':credentials['pass']}

    while True:
        time.sleep(3)
        logged_request = session.post(base_url+login_form_url, data=params)
        
        if logged_request.ok:
            logging.info('[*] Logged in.')
            break


def crawl_profile(session, base_url, profile_url, post_limit):
    """Goes to profile URL, crawls it and extracts posts URLs.
    """
    profile_bs = get_bs(session, profile_url)
    n_scraped_posts = 0
    scraped_posts = list()
    posts_id = None

    while n_scraped_posts < post_limit:
        try:
            posts_id = 'recent'
            posts = profile_bs.find('div', id=posts_id).div.div.contents
        except Exception:
            posts_id = 'structured_composer_async_container'
            posts = profile_bs.find('div', id=posts_id).div.div.contents

        posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] 

        for post_url in posts_urls:
            # print(post_url)
            try:
                post_data = scrape_post(session, base_url, post_url)
                scraped_posts.append(post_data)
            except Exception as e:
                logging.info('Error: {}'.format(e))
            n_scraped_posts += 1
            if posts_completed(scraped_posts, post_limit):
                break
        
        show_more_posts_url = None
        if not posts_completed(scraped_posts, post_limit):
            show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
            profile_bs = get_bs(session, base_url+show_more_posts_url)
            time.sleep(3)
        else:
            break
    return scraped_posts

def posts_completed(scraped_posts, limit):
    """Returns true if the amount of posts scraped from
    profile has reached its limit.
    """
    if len(scraped_posts) == limit:
        return True
    else:
        return False


def scrape_post(session, base_url, post_url):
    """Goes to post URL and extracts post data.
    """
    post_data = OrderedDict()

    post_bs = get_bs(session, base_url+post_url)
    time.sleep(5)

    # Here we populate the OrderedDict object
    post_data['url'] = post_url

    try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []
    
    try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''
    

    try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []
    
    return dict(post_data)


def extract_comments(session, base_url, post_bs, post_url):
    """Extracts all coments from post
    """
    comments = list()
    show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']
    first_comment_page = True

    logging.info('Scraping comments from {}'.format(post_url))
    while True:

        logging.info('[!] Scraping comments.')
        time.sleep(3)
        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)
        
        try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break
        
        for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))
        
        show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break
    
    return comments


def json_to_obj(filename):
    """Extracts dta from JSON file and saves it on Python object
    """
    obj = None
    with open(filename) as json_file:
        obj = json.loads(json_file.read())
    return obj


def save_data(data):
    """Converts data to JSON.
    """
    with open('profile_posts_data.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)


if __name__ == "__main__":

    logging.basicConfig(level=logging.INFO)
    base_url = 'https://mobile.facebook.com'
    session = requests.session()

    # Extracts credentials for the login and all of the profiles URL to scrape
    credentials = json_to_obj('credentials.json')
    profiles_urls = json_to_obj('profiles_urls.json')

    make_login(session, base_url, credentials)

    posts_data = None
    for profile_url in profiles_urls:
        posts_data = crawl_profile(session, base_url, profile_url, 25)
    logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))
    logging.info('[!] Saving.')
    save_data(posts_data)

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

183

184

185

186

187

188

189

190

191

192

193

194

195

196

197

198

199

200

201

202

203

204

205

206

207

208

209

210

211

212

213

214

215

216

217

218

219

import requests

import re

import json

import time

import logging

import pandas

from collections import OrderedDict

from bs4 import BeautifulSoup

def get_bs(session, url):

"""Makes a GET requests using the given Session object

and returns a BeautifulSoup object.

"""

r = None

while True:

r = session.get(url)

if r.ok:

break

return BeautifulSoup(r.text, 'lxml')

def make_login(session, base_url, credentials):

"""Returns a Session object logged in with credentials.

"""

login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\

'%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

params = {'email':credentials['email'], 'pass':credentials['pass']}

while True:

time.sleep(3)

logged_request = session.post(base_url+login_form_url, data=params)

if logged_request.ok:

logging.info('[*] Logged in.')

break

def crawl_profile(session, base_url, profile_url, post_limit):

"""Goes to profile URL, crawls it and extracts posts URLs.

"""

profile_bs = get_bs(session, profile_url)

n_scraped_posts = 0

scraped_posts = list()

posts_id = None

while n_scraped_posts < post_limit:

try:

posts_id = 'recent'

posts = profile_bs.find('div', id=posts_id).div.div.contents

except Exception:

posts_id = 'structured_composer_async_container'

posts = profile_bs.find('div', id=posts_id).div.div.contents

posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')]

for post_url in posts_urls:

# print(post_url)

try:

post_data = scrape_post(session, base_url, post_url)

scraped_posts.append(post_data)

except Exception as e:

logging.info('Error: {}'.format(e))

n_scraped_posts += 1

if posts_completed(scraped_posts, post_limit):

break

show_more_posts_url = None

if not posts_completed(scraped_posts, post_limit):

show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']

profile_bs = get_bs(session, base_url+show_more_posts_url)

time.sleep(3)

else:

break

return scraped_posts

def posts_completed(scraped_posts, limit):

"""Returns true if the amount of posts scraped from

profile has reached its limit.

"""

if len(scraped_posts) == limit:

return True

else:

return False

def scrape_post(session, base_url, post_url):

"""Goes to post URL and extracts post data.

"""

post_data = OrderedDict()

post_bs = get_bs(session, base_url+post_url)

time.sleep(5)

# Here we populate the OrderedDict object

post_data['url'] = post_url

try:

post_text_element = post_bs.find('div', id='u_0_0').div

string_groups = [p.strings for p in post_text_element.find_all('p')]

strings = [repr(string) for group in string_groups for string in group]

post_data['text'] = strings

except Exception:

post_data['text'] = []

try:

post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']

except Exception:

post_data['media_url'] = ''

try:

post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)

except Exception:

post_data['comments'] = []

return dict(post_data)

def extract_comments(session, base_url, post_bs, post_url):

"""Extracts all coments from post

"""

comments = list()

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']

first_comment_page = True

logging.info('Scraping comments from {}'.format(post_url))

while True:

logging.info('[!] Scraping comments.')

time.sleep(3)

if first_comment_page:

first_comment_page = False

else:

post_bs = get_bs(session, base_url+show_more_url)

time.sleep(3)

try:

comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\

.find_all('div', id=re.compile('^\d+'))

except Exception:

pass

if len(comments_elements) != 0:

logging.info('[!] There are comments.')

else:

break

for comment in comments_elements:

comment_data = OrderedDict()

comment_data['text'] = list()

try:

comment_strings = comment.find('h3').next_sibling.strings

for string in comment_strings:

comment_data['text'].append(string)

except Exception:

pass

try:

media = comment.find('h3').next_sibling.next_sibling.children

if media is not None:

for element in media:

comment_data['media_url'] = element['src']

else:

comment_data['media_url'] = ''

except Exception:

pass

comment_data['profile_name'] = comment.find('h3').a.string

comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]

comments.append(dict(comment_data))

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))

if 'View more' in show_more_url.text:

logging.info('[!] More comments.')

show_more_url = show_more_url['href']

else:

break

return comments

def json_to_obj(filename):

"""Extracts dta from JSON file and saves it on Python object

"""

obj = None

with open(filename) as json_file:

obj = json.loads(json_file.read())

return obj

def save_data(data):

"""Converts data to JSON.

"""

with open('profile_posts_data.json', 'w') as json_file:

json.dump(data, json_file, indent=4)

if __name__ == "__main__":

logging.basicConfig(level=logging.INFO)

base_url = 'https://mobile.facebook.com'

session = requests.session()

# Extracts credentials for the login and all of the profiles URL to scrape

credentials = json_to_obj('credentials.json')

profiles_urls = json_to_obj('profiles_urls.json')

make_login(session, base_url, credentials)

posts_data = None

for profile_url in profiles_urls:

posts_data = crawl_profile(session, base_url, profile_url, 25)

logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))

logging.info('[!] Saving.')

save_data(posts_data)

Running the Script

You can run the script by running the following command in your Terminal or CMD:

$ python facebook_profile_scraper.py

1 2	$ python facebook_profile_scraper.py

After completion you will have a JSON file containing the data extracted:

[
    {
        "url": "/story.php?story_fbid=1201918583328686&id=826604640860084&refid=17&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.4%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.826604640860084%3A306061129499414%3A2%3A0%3A1575187199%3A3518174746269382888&__tn__=%2AW-R#footer_action_list",
        "text": [
            "'Cute moments like these r my weakness'",
            "' Follow our insta page: '",
            "'https://'",
            "'instagram.com/'",
            "'_disquieting_'"
        ],
        "media_url": "/Disquietingg/?refid=52&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.9%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A9%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D&__tn__=C-R",
        "comments": [
            {
                "text": [
                    "Diana Vanessa",
                    " darling ",
                    "\u2764\ufe0f"
                ],
                "profile_name": "Zeus Alejandro",
                "profile_url": "/ZeusAlejandroXd"
            },
            {
                "text": [
                    "Ema Yordanova",
                    " my love ",
                    "<3"
                ],
                "profile_name": "Sam Mihov",
                "profile_url": "/darknessBornFromLight"
            },
...
...
...
            {
                "text": [
                    "Your one and only sunshine ;3"
                ],
                "profile_name": "Edgar G\u00f3mez S\u00e1nchez",
                "profile_url": "/edgar.gomezsanchez.7"
            }
        ]
    }
]

[

{

"url": "/story.php?story_fbid=1201918583328686&id=826604640860084&refid=17&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.4%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.826604640860084%3A306061129499414%3A2%3A0%3A1575187199%3A3518174746269382888&__tn__=%2AW-R#footer_action_list",

"text": [

"'Cute moments like these r my weakness'",

"' Follow our insta page: '",

"'https://'",

"'instagram.com/'",

"'_disquieting_'"

"media_url": "/Disquietingg/?refid=52&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.9%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A9%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D&__tn__=C-R",

"comments": [

{

"text": [

"Diana Vanessa",

" darling ",

"\u2764\ufe0f"

"profile_name": "Zeus Alejandro",

"profile_url": "/ZeusAlejandroXd"

{

"text": [

"Ema Yordanova",

" my love ",

"<3"

"profile_name": "Sam Mihov",

"profile_url": "/darknessBornFromLight"

...

{

"text": [

"Your one and only sunshine ;3"

"profile_name": "Edgar G\u00f3mez S\u00e1nchez",

"profile_url": "/edgar.gomezsanchez.7"

}

]

}

]

Conclusion

This may seem like a simple script, but it has its trick to master; you need to have experience with different subjects like: Regular expressions, requests and BeautifulSoup. We hope you have learn more about scraping in this post, as a practice you can try to extract the same data using different selectors or even extract the amount of reactions that a post have.

Oswaldo Alcala

Hello! My name is Oswaldo; I’m a Mathematics student from Venezuela. I’m a Python programmer interested in Web Scraping, Machine learning and Mobile Development.

I like maths, coding and problem solving!

Rating: 4.1/5. From 21 votes.

Please wait...

Extracting Facebook Posts & Comments with BeautifulSoup & Requests

Python Packages

Scraping Facebook with Requests

How will the script crawl and scrape Facebook mobile?

Receiving the Input Data

Logging into Facebook

Crawling a Facebook Profile/Page

Scraping Data from Facebook Posts

Extracting Facebook Comments

Complete Code

Running the Script

Conclusion

Related

Leave a Reply Cancel reply

Python Packages

Scraping Facebook with Requests

How will the script crawl and scrape Facebook mobile?

Receiving the Input Data

Logging into Facebook

Crawling a Facebook Profile/Page

Scraping Data from Facebook Posts

Extracting Facebook Comments

Complete Code

Running the Script

Conclusion

Share this tutorial:

Related

Leave a Reply Cancel reply

Want to learn more?