Facebook is the biggest social network of our times, containing a lot of valuable data that can be useful in so many cases. Imagine being able to extract this data and use it as your project’s dataset.
In this tutorial, you are going to use Python to extract data from any Facebook profile or page. The data that you will be extracting from a predefined amount of posts is:
- Post URL
- Post text
- Post media URL
You will be extracting comments from posts as well and from each comment:
- Profile name
- Profile URL
- Comment text
Of course, there is plenty more data that can be extracted from Facebook but for this tutorial that will be enough.
Python Packages
For this tutorial, you will need the following Python packages:
- requests
- re
- json
- time
- logging
- collections
- bs4 (BeautifulSoup)
Remember to install these packages on a Python Virtual Environment for this project alone, it is a better practice.
Scraping Facebook with Requests
As you may know, Facebook is pretty loaded of JavaScript but the requests package does not render JavaScript; it only allows you to make simple web requests like GET and POST.
Important: In this tutorial, you will be scraping and crawling the mobile version of Facebook since it will allow you to extract the needed data with simple requests.
How will the script crawl and scrape Facebook mobile?
First of all you need to take into account what the script will be exactly doing, the script will:
- Receive a list of Facebook profiles URLs from a file.
- Receive credentials from a file to make a login using requests package.
- Make a login using a Session object from requests package.
- For each profile URL we are going to extract data from a predefined amount of posts.
The script will look like this on its main function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
if __name__ == "__main__": logging.basicConfig(level=logging.INFO) base_url = 'https://mobile.facebook.com' session = requests.session() # Extracts credentials for the login and all of the profiles URL to scrape credentials = json_to_obj('credentials.json') profiles_urls = json_to_obj('profiles_urls.json') make_login(session, base_url, credentials) posts_data = None for profile_url in profiles_urls: posts_data = crawl_profile(session, base_url, profile_url, 25) logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data))) logging.info('[!] Saving.') save_data(posts_data) |
You are using the logging package to put some log messages on the script execution so you know what the script is actually doing.
Then you define a base_url that will be the Facebook mobile URL.
After extracting the input data from files you make the login calling the function make_login that you will be defining shortly.
Then for each profile URL on out input data you are going to scrape the data from a specific amount of posts using the crawl_profile function.
Receiving the Input Data
As it is stated previously, the script will need to receive data from 2 different sources: a file containing profiles URLs and another one containing credentials from a Facebook account to make the login. Let’s define a function that will allow you to extract this data from JSON files:
1 2 3 4 5 6 7 8 |
def json_to_obj(filename): """Extracts data from JSON file and saves it on Python object """ obj = None with open(filename) as json_file: obj = json.loads(json_file.read()) return obj |
This function will allow you to extract data formatted in JSON and convert it in a Python object.
The files profiles_urls.json and credentials.json are the ones that will contain the input data that the script needs.
profiles_urls.json :
1 2 3 4 5 |
[ "https://mobile.facebook.com/profileURL1/", "https://mobile.facebook.com/profileURL2" ] |
credentials.json :
1 2 3 4 5 |
{ "email":"username@mail.com", "pass":"password" } |
You will need to replace the profiles URLs that you want to extract data from and the Facebook account’s credentials form the login.
Logging into Facebook
To make the login you will need to inspect the Facebook main page (mobile.facebook.com) on its mobile version to know the URL of the form to make the login.
If we do a right click on the “Log In” button you can get to the form to which we have to send the credentials :
The URL from the form element with the id="login_form" is the one you need to make the login. Let’s define the function that will help you with this task :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
def make_login(session, base_url, credentials): """Returns a Session object logged in with credentials. """ login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\ '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100' params = {'email':credentials['email'], 'pass':credentials['pass']} while True: time.sleep(3) logged_request = session.post(base_url+login_form_url, data=params) if logged_request.ok: logging.info('[*] Logged in.') break |
Using the action URL from the form element you can make a POST request with Python’s requests package. If our response is OK is because you have logged in successfully, else you wait a little and try again.
Crawling a Facebook Profile/Page
Once you are logged in, you need to crawl the Facebook profile or page URL in order to extract its public posts.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
def crawl_profile(session, base_url, profile_url, post_limit): """Goes to profile URL, crawls it and extracts posts URLs. """ profile_bs = get_bs(session, profile_url) n_scraped_posts = 0 scraped_posts = list() posts_id = None while n_scraped_posts < post_limit: try: posts_id = 'recent' posts = profile_bs.find('div', id=posts_id).div.div.contents except Exception: posts_id = 'structured_composer_async_container' posts = profile_bs.find('div', id=posts_id).div.div.contents posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] for post_url in posts_urls: # print(post_url) try: post_data = scrape_post(session, base_url, post_url) scraped_posts.append(post_data) except Exception as e: logging.info('Error: {}'.format(e)) n_scraped_posts += 1 if posts_completed(scraped_posts, post_limit): break show_more_posts_url = None if not posts_completed(scraped_posts, post_limit): show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href'] profile_bs = get_bs(session, base_url+show_more_posts_url) time.sleep(3) else: break return scraped_posts |
Fist you save the result of the get_bs function into the profile_bs variable. get_bs function receives a Session object and a url variable:
1 2 3 4 5 6 7 8 9 10 11 12 |
def get_bs(session, url): """Makes a GET requests using the given Session object and returns a BeautifulSoup object. """ r = None while True: r = session.get(url) time.sleep(3) if r.ok: break return BeautifulSoup(r.text, 'lxml') |
The get_bs function will make a GET request using the Session object, if the request code is OK then we return a BeautifulSoup object created with the response we get.
Let’s break down this crawl_profile function:
- Once you have the profile_bs variable, you define variables for the number of posts scraped, the posts and the posts id.
- Then you open a while loop that will iterate always that the n_scraped_posts variable is less than post_limit variable.
- Inside this while loop you try to find the HTML element that holds all of the elements where the posts are. If the Facebook URL is a Facebook page, then the posts will be on the element with the id='recent' but if the Facebook URL is a person’s profile, then the posts will be on the element with the id='structured_composer_async_container' .
- Once you know the elements in which the posts are, you can extract theirs URLs.
- Then, for each post URL that you have discovered, you are going to call the scrape_post function and append that result to the scraped_posts list.
- If you have reached the amount of posts that you predefined, then you break the while loop.
Scraping Data from Facebook Posts
Not let’s take a look at the function that will allow you to start the real scraping:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
def scrape_post(session, base_url, post_url): """Goes to post URL and extracts post data. """ post_data = OrderedDict() post_bs = get_bs(session, base_url+post_url) time.sleep(5) # Here we populate the OrderedDict object post_data['url'] = post_url try: post_text_element = post_bs.find('div', id='u_0_0').div string_groups = [p.strings for p in post_text_element.find_all('p')] strings = [repr(string) for group in string_groups for string in group] post_data['text'] = strings except Exception: post_data['text'] = [] try: post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href'] except Exception: post_data['media_url'] = '' try: post_data['comments'] = extract_comments(session, base_url, post_bs, post_url) except Exception: post_data['comments'] = [] return dict(post_data) |
This function starts creating an OrderedDict object that will be the one who holds the post data:
- Post URL
- Post text
- Post media URL
- Comments
First you need the post HTML code in a BeautifulSoup object so use get_bs function for that.
Since you already know the post URL at this point you just need to add it to the post_data object.
To extract the post text you need to find the post main element, as follows:
1 2 3 4 5 6 7 8 |
try: post_text_element = post_bs.find('div', id='u_0_0').div string_groups = [p.strings for p in post_text_element.find_all('p')] strings = [repr(string) for group in string_groups for string in group] post_data['text'] = strings except Exception: post_data['text'] = [] |
You look for the div containing all the text, but this element can contain several <p> tags containing text so you iterate over all of them and extract its text.
After that you extract the post media URL. Facebook posts contains either images or video or even it could be only text:
1 2 3 4 5 |
try: post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href'] except Exception: post_data['media_url'] = '' |
Finally you call the function extract_comments to extract the remaining data:
1 2 3 4 5 |
try: post_data['comments'] = extract_comments(session, base_url, post_bs, post_url) except Exception: post_data['comments'] = [] |
Extracting Facebook Comments
This function is the larger for this tutorial, here you iterate over a while loop until there are no more comments to be extracted:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 |
def extract_comments(session, base_url, post_bs, post_url): """Extracts all coments from post """ comments = list() show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href'] first_comment_page = True logging.info('Scraping comments from {}'.format(post_url)) while True: logging.info('[!] Scraping comments.') time.sleep(3) if first_comment_page: first_comment_page = False else: post_bs = get_bs(session, base_url+show_more_url) time.sleep(3) try: comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\ .find_all('div', id=re.compile('^\d+')) except Exception: pass if len(comments_elements) != 0: logging.info('[!] There are comments.') else: break for comment in comments_elements: comment_data = OrderedDict() comment_data['text'] = list() try: comment_strings = comment.find('h3').next_sibling.strings for string in comment_strings: comment_data['text'].append(string) except Exception: pass try: media = comment.find('h3').next_sibling.next_sibling.children if media is not None: for element in media: comment_data['media_url'] = element['src'] else: comment_data['media_url'] = '' except Exception: pass comment_data['profile_name'] = comment.find('h3').a.string comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0] comments.append(dict(comment_data)) show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story')) if 'View more' in show_more_url.text: logging.info('[!] More comments.') show_more_url = show_more_url['href'] else: break return comments |
You need to be aware if you are extracting the first page of comments or the following pages so you define a first_comment_page variable as True.
You look if there is a “View More Comments” link, this will tell us if you are going to keep iterating over the loop or not:
1 2 |
show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href'] |
In the main loop of the function, first you are going to check the value of first_comment_page , if it is True, then you extract the comments from that current page, else you make a requests to the “View More Comments” URL:
1 2 3 4 5 6 |
if first_comment_page: first_comment_page = False else: post_bs = get_bs(session, base_url+show_more_url) time.sleep(3) Value 1,229.01 Mil.Baht |
After this you select all the HTML elements that contain the comments. You need to do a second click on any comment, you will see that each comment is inside a div with a 17-digit ID:
Knowing this you can select all the elements as follow:
1 2 3 4 5 6 7 8 9 10 11 |
try: comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\ .find_all('div', id=re.compile('^\d+')) except Exception: pass if len(comments_elements) != 0: logging.info('[!] There are comments.') else: break |
If you cannot find elements, that means that there are not elements. Now, for each comment you are going to create an OrderedDict object where you will save all the data from that comment:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
for comment in comments_elements: comment_data = OrderedDict() comment_data['text'] = list() try: comment_strings = comment.find('h3').next_sibling.strings for string in comment_strings: comment_data['text'].append(string) except Exception: pass try: media = comment.find('h3').next_sibling.next_sibling.children if media is not None: for element in media: comment_data['media_url'] = element['src'] else: comment_data['media_url'] = '' except Exception: pass comment_data['profile_name'] = comment.find('h3').a.string comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0] comments.append(dict(comment_data)) |
Inside this loop you are going to extract the comment text, looking for the HTML element that contains the text, as in the text of the post, you need to find all the elements that contains strings and add each string to a list:
1 2 3 4 5 6 7 |
try: comment_strings = comment.find('h3').next_sibling.strings for string in comment_strings: comment_data['text'].append(string) except Exception: pass |
Next, you need the media URL:
1 2 3 4 5 6 7 8 9 10 |
try: media = comment.find('h3').next_sibling.next_sibling.children if media is not None: for element in media: comment_data['media_url'] = element['src'] else: comment_data['media_url'] = '' except Exception: pass |
After you got this data you need the profile name and profile URL, these you can find as follows:
1 2 3 |
comment_data['profile_name'] = comment.find('h3').a.string comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0] |
Once you have all the data you can get from a comment, you add that data to the list of comments. Next you need to check if there is a “Show more comments” link:
1 2 3 4 5 6 7 |
show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story')) if 'View more' in show_more_url.text: logging.info('[!] More comments.') show_more_url = show_more_url['href'] else: break |
The loop that is extracting the comments will stop if it cannot find any more comments and the loop extracting the posts data will stop after it reach the post limit that you have given it.
Complete Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 |
import requests import re import json import time import logging import pandas from collections import OrderedDict from bs4 import BeautifulSoup def get_bs(session, url): """Makes a GET requests using the given Session object and returns a BeautifulSoup object. """ r = None while True: r = session.get(url) if r.ok: break return BeautifulSoup(r.text, 'lxml') def make_login(session, base_url, credentials): """Returns a Session object logged in with credentials. """ login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\ '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100' params = {'email':credentials['email'], 'pass':credentials['pass']} while True: time.sleep(3) logged_request = session.post(base_url+login_form_url, data=params) if logged_request.ok: logging.info('[*] Logged in.') break def crawl_profile(session, base_url, profile_url, post_limit): """Goes to profile URL, crawls it and extracts posts URLs. """ profile_bs = get_bs(session, profile_url) n_scraped_posts = 0 scraped_posts = list() posts_id = None while n_scraped_posts < post_limit: try: posts_id = 'recent' posts = profile_bs.find('div', id=posts_id).div.div.contents except Exception: posts_id = 'structured_composer_async_container' posts = profile_bs.find('div', id=posts_id).div.div.contents posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] for post_url in posts_urls: # print(post_url) try: post_data = scrape_post(session, base_url, post_url) scraped_posts.append(post_data) except Exception as e: logging.info('Error: {}'.format(e)) n_scraped_posts += 1 if posts_completed(scraped_posts, post_limit): break show_more_posts_url = None if not posts_completed(scraped_posts, post_limit): show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href'] profile_bs = get_bs(session, base_url+show_more_posts_url) time.sleep(3) else: break return scraped_posts def posts_completed(scraped_posts, limit): """Returns true if the amount of posts scraped from profile has reached its limit. """ if len(scraped_posts) == limit: return True else: return False def scrape_post(session, base_url, post_url): """Goes to post URL and extracts post data. """ post_data = OrderedDict() post_bs = get_bs(session, base_url+post_url) time.sleep(5) # Here we populate the OrderedDict object post_data['url'] = post_url try: post_text_element = post_bs.find('div', id='u_0_0').div string_groups = [p.strings for p in post_text_element.find_all('p')] strings = [repr(string) for group in string_groups for string in group] post_data['text'] = strings except Exception: post_data['text'] = [] try: post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href'] except Exception: post_data['media_url'] = '' try: post_data['comments'] = extract_comments(session, base_url, post_bs, post_url) except Exception: post_data['comments'] = [] return dict(post_data) def extract_comments(session, base_url, post_bs, post_url): """Extracts all coments from post """ comments = list() show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href'] first_comment_page = True logging.info('Scraping comments from {}'.format(post_url)) while True: logging.info('[!] Scraping comments.') time.sleep(3) if first_comment_page: first_comment_page = False else: post_bs = get_bs(session, base_url+show_more_url) time.sleep(3) try: comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\ .find_all('div', id=re.compile('^\d+')) except Exception: pass if len(comments_elements) != 0: logging.info('[!] There are comments.') else: break for comment in comments_elements: comment_data = OrderedDict() comment_data['text'] = list() try: comment_strings = comment.find('h3').next_sibling.strings for string in comment_strings: comment_data['text'].append(string) except Exception: pass try: media = comment.find('h3').next_sibling.next_sibling.children if media is not None: for element in media: comment_data['media_url'] = element['src'] else: comment_data['media_url'] = '' except Exception: pass comment_data['profile_name'] = comment.find('h3').a.string comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0] comments.append(dict(comment_data)) show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story')) if 'View more' in show_more_url.text: logging.info('[!] More comments.') show_more_url = show_more_url['href'] else: break return comments def json_to_obj(filename): """Extracts dta from JSON file and saves it on Python object """ obj = None with open(filename) as json_file: obj = json.loads(json_file.read()) return obj def save_data(data): """Converts data to JSON. """ with open('profile_posts_data.json', 'w') as json_file: json.dump(data, json_file, indent=4) if __name__ == "__main__": logging.basicConfig(level=logging.INFO) base_url = 'https://mobile.facebook.com' session = requests.session() # Extracts credentials for the login and all of the profiles URL to scrape credentials = json_to_obj('credentials.json') profiles_urls = json_to_obj('profiles_urls.json') make_login(session, base_url, credentials) posts_data = None for profile_url in profiles_urls: posts_data = crawl_profile(session, base_url, profile_url, 25) logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data))) logging.info('[!] Saving.') save_data(posts_data) |
Running the Script
You can run the script by running the following command in your Terminal or CMD:
1 2 |
$ python facebook_profile_scraper.py |
After completion you will have a JSON file containing the data extracted:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
[ { "url": "/story.php?story_fbid=1201918583328686&id=826604640860084&refid=17&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.4%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.826604640860084%3A306061129499414%3A2%3A0%3A1575187199%3A3518174746269382888&__tn__=%2AW-R#footer_action_list", "text": [ "'Cute moments like these r my weakness'", "' Follow our insta page: '", "'https://'", "'instagram.com/'", "'_disquieting_'" ], "media_url": "/Disquietingg/?refid=52&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.9%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A9%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D&__tn__=C-R", "comments": [ { "text": [ "Diana Vanessa", " darling ", "\u2764\ufe0f" ], "profile_name": "Zeus Alejandro", "profile_url": "/ZeusAlejandroXd" }, { "text": [ "Ema Yordanova", " my love ", "<3" ], "profile_name": "Sam Mihov", "profile_url": "/darknessBornFromLight" }, ... ... ... { "text": [ "Your one and only sunshine ;3" ], "profile_name": "Edgar G\u00f3mez S\u00e1nchez", "profile_url": "/edgar.gomezsanchez.7" } ] } ] |
Conclusion
This may seem like a simple script, but it has its trick to master; you need to have experience with different subjects like: Regular expressions, requests and BeautifulSoup. We hope you have learn more about scraping in this post, as a practice you can try to extract the same data using different selectors or even extract the amount of reactions that a post have.
Hello! My name is Oswaldo; I’m a Mathematics student from Venezuela. I’m a Python programmer interested in Web Scraping, Machine learning and Mobile Development.
I like maths, coding and problem solving!