Booking.com is a travel fare aggregator website and travel metasearch engine for lodging reservations. This websites has more than 29,094,365 listings in 230 countries and territories worldwide.
Websites like Booking.com contains a lot of data that can be scraped and processes that can be automatized.
In this Selenium tutorial, will learn how to automate an accommodation search and to scrape the results using Python with Selenium.
We could use the Booking API to make all this process, but in this tutorial is want to help you learn Selenium in a practical way so you can build something useful and learn at the same time.
Let’s start working!
Prepare Workspace
For this tutorial, we will be using Python 3.7.1 and Selenium. You will also need Firefox or Google Chrome in order to run the Selenium WebDriver.
Create Virtual Environment
Although it is optional, it is recommended you create a virtual environment for this project using virtualenv
1 2 |
virtualenv bookingSelenium |
Inside your folder bookingSelenium, activate the virtual environment using:
1 2 |
source bin/activate |
Install Selenium
All what you need for this project is Selenium. You can install Selenium with any Python package manager as pip :
1 2 |
pip install selenium |
Scraping Process
There are several ways in which you can scrape websites. Since we are working with Selenium, we can handle Javascript in the pages and scrape them in a very direct way. Let’s see how many steps will be taking for this scraping process:
- Let your Selenium WebDriver enter in the domain (booking.com).
- Perform a search in the main page with the parameters that the script will be receiving.
- When the search results are ready, scrape all the data in those links.
- When you reach the amount of results needed, stop the scraping and import those results to JSON format.
Prepare WebDriver
What is WebDriver?
Selenium is a browser automation tool that controls web browser instances and make it easy to do repetitive tasks. The Python Selenium API has the WebDriver class that helps you to write the instructions for the browser in Python.
A WebDriver object is just a Python class that is linked to a browser process and gives the programmer the ease to control the browser state through Python code.
Download GeckoDriver
What is Gecko and GeckoDriver? Gecko is a web browser engine used in some browsers such as Firefox. GeckoDriver acts as the link between your scripts in Selenium and the Firefox browser.
Download your operating system compatible GeckoDriver at : https://github.com/mozilla/geckodriver/releases
If you are in a Archlinux derived distribution you can use any package manager to download the Geckodriver package:
1 2 |
sudo pacman -S geckodriver |
After the installation you need to know where your geckodriver is located. In Linux you can use the which command to the location of any script or program in your system:
1 2 3 |
$ which geckodriver /usr/bin/geckodriver |
This location needs to be in the system path to be able to use the geckodriver:
1 2 |
export PATH=$PATH:/usr/bin/geckodriver |
Now we can use the geckodriver in our script.
Import Required Classes
To use Selenium WebDriver class, import these classes from this package:
1 2 3 4 5 6 7 |
import selenium from selenium.webdriver import Firefox from selenium.webdriver.common.by import By from selenium.webdriver.firefox.options import Options from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait |
Let’s explain the Python packages needed to use Selenium API one by one:
- from selenium.webdriver import Firefox specifies that the browser that you want to automate will be an instance of Firefox web browser. For using a Chrome browser, import selenium.webdriver.Chrome instead.
- from selenium.webdriver.common.by import By helps you locate elements in a webpage by its tag name, class name, css selector and xpath.
- from selenium.webdriver.firefox.options import Options can hold a list of arguments that will be passed to your Firefox WebDriver.
- from selenium.webdriver.support import expected_conditions as EC allows you to define conditions for our browser.
- from selenium.webdriver.support.wait import WebDriverWait allows you to define implicit and explicit waits.
1. Browse Website with Selenium WebDriver
1 2 3 4 5 6 7 8 9 10 |
def prepare_driver(url): '''Returns a Firefox Webdriver.''' options = Options() options.add_argument('-headless') driver = Firefox(executable_path='geckodriver', options=options) driver.get(url) wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located( (By.ID, 'ss'))) return driver |
• Headless: In this tutorial, we will use our browser in the headless mode, this way the browser will run normally but without any visible graphical user interface components. Though not useful for surfing the web, it comes into its own with automatization.
In order to Firefox to run in headless mode we’ll need to create an Options object and add the -headless argument to it.
• GeckoDriver Path: Specify the GeckoDriver location (that you have downloaded in the Preparing WebDriver section of this tutorial) passing it in the executable_path argument. With this, we’ll have our WebDriver ready and waiting for instructions.
• URL: Our Selenium WebDriver object is just like a normal browser, so it can do everything a normal browser do. One of the most common task is visit a URL, this can be performed just with one line of code:
1 2 |
driver.get(url) |
The get() method tells our WebDriver to visit a URL and nothing more.
Our WebDriver will be visiting booking.com and from there we’ll start the scraping process.
• Wait: We need to wait until the main page’s search bar is available to continue, for that we’re using the WebDriverWait class which defines an explicit wait in our WebDriver. How do we know that we need to wait for the element with the ID ss ? Well, since we need the search bar ready to make the search, that’s the element we are telling our WebDriver to wait. We know that it has the ss ID because we perform a simple “Inspect element” on it (this will be explained in detail later).
2. Perform a Search in Booking.com Homepage
How can you tell your WebDriver where to click or where to insert text? The WebDriver class has the find_element functions which allows you to find any element inside the current page.
There are several different functions depending on the way you look for elements in the page. You can look for elements using its class name, ID, tag name, XPath selector, link text, partial link text, name and css selector.
- find_element_by_id
- find_element_by_name
- find_element_by_xpath
- find_element_by_link_text
- find_element_by_partial_link_text
- find_element_by_tag_name
- find_element_by_class_name
- find_element_by_css_selector
To know which of these functions is better to use, you need to take a look at the page’s HTML code. This will tell you the more precise way to find the element you want.
You just need to do a right-click in the search form and click “Inspect element”. This will lead you to the element’s HTML code:
The highlighted line is the search form’s HTML code, let’s take a closer look:
1 2 3 4 5 |
<input type="search" name="ss" id="ss" class="c-autocomplete__input sb-searchbox__input sb-destination__input" placeholder="Where are you going?" value="" autocomplete="off" data-component="search/destination/input-placeholder" data-sb-id="main" data-input="" aria-autocomplete="both" aria-label="Type your destination"> |
Here, you can see that the element’s ID is ss , knowing this you can use one of the find_element functions to tell our WebDriver the element it needs to locate.
Let’s define a function that will perform a search in the Booking.com main page:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
def fill_form(driver, search_argument): '''Receives a search_argument to insert it in the search bar and then clicks the search button.''' search_field = driver.find_element_by_id('ss') search_field.send_keys(search_argument) # We look for the search button and click it driver.find_element_by_class_name('sb-searchbox__button')\ .click() wait = WebDriverWait(driver, timeout=10).until( EC.presence_of_all_elements_located( (By.CLASS_NAME, 'sr-hotel__title'))) |
Pass your web driver and a string as arguments; this string will be the city in which you want to look for accommodations. Let’s review each line inside this function:
- search_field = driver.find_element_by_id('ss') makes use of the driver.find_element_by_id() function that looks for an element with the ID passed as argument; as we know the IDs are unique so it’ll return the search bar.
- search_field.send_keys(search_argument) since you already have the search bar selected, you have to tell your WebDriver to put some text on it. Here we use the send_keys(string) function which takes a string as argument and put it in the search form.
- driver.find_element_by_class_name('sb-searchbox__button').click() since you have inserted the city you want to search in the search bar, you need to click the “Search” button to perform the search. Here we use the find_element_by_class_name() function which receives a string representing the class element we’re looking for and then we call the click() function which simply perform a click on the selected element.
- wait = WebDriverWait(driver, timeout=10).until(EC.presence_of_all_elements_located( (By.CLASS_NAME, 'sr-hotel__title'))) here you are telling your WebDriver to wait until the elements with the class name sr-hotel-title (the one containing the accommodations titles) appear.
After this function completes its process, your WebDriver will be in the search results page showing something like this:
3. Scrape the Results
Since you have already performed your search, you can start to visit each hotel link and extract the data you need.
For the accommodations we’ll be extracting:
- Name
- Location
- Popular Facilities
- Review Score
Create a function that will extract a predetermined number of accommodation links and then scrape the data you want from them.
Let’s define other two functions, one will extract the links and the other will scrape the data from each link.
Extract accommodation links
You need to know how to extract all of the accommodations’ links in the search results page. Fortunately, Selenium have the find_elements functions that work just as find_element functions but finding all the elements with the specified feature instead of just one.
The syntax of find_elements functions is very similar; the only word that changes is “element” to “elements”:
- find_elements_by_name
- find_elements_by_xpath
- find_elements_by_link_text
- find_elements_by_partial_link_text
- find_elements_by_tag_name
- find_elements_by_class_name
- find_elements_by_css_selector
We don’t have find_elements_by_id() function since the IDs are unique and there cannot be two elements with the same ID.
Using find_elements functions, you can now extract the accommodations links in the search results page. Inspecting one of the accommodations title, we find out that they share a common class which is sr-hotel__title
1 2 3 4 5 6 7 8 9 10 11 |
<h3 class="sr-hotel__title "> <a class="hotel_name_link url" href=" /hotel/es/ramblashotel.html?label=gen173nr-1FCAEoggI46AdIM1gEaPEBiAEBmAExuAEZyAEM2AEB6AEB-AECiAIBqAID&sid=1e154445674ab2efff732c570110d2bc&ucfs=1&srpvid=cda75884d26c016e&srepoch=1546950921&hpos=1&hapos=1&dest_id=-372490&dest_type=city&sr_order=popularity&from=searchresults ;highlight_room=#hotelTmpl" target="_blank" rel="noopener"> <span class="sr-hotel__name" data-et-click=""> Ramblas Hotel </span> <span class="invisible_spoken">Opens in new window</span> </a> </h3> |
Use find_elements_by_class_name to select all of the h3 elements and then find the anchor tag inside these to extract the accommodation url :
1 2 |
accommodations_titles = driver.find_elements_by_class_name('sr-hotel__title') |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def scrape_results(driver, n_results): '''Returns the data from n_results amount of results.''' accommodations_urls = list() accommodations_data = list() # Get the accommodations links for accomodation_title in driver.find_elements_by_class_name('sr-hotel__title'): accommodations_urls.append(accomodation_title.find_element_by_class_name( 'hotel_name_link').get_attribute('href')) # Iterate over a defined range and scrape the links for url in range(0, n_results): if url == n_results: break url_data = scrape_accommodation_data(driver, accommodations_urls[url]) accommodations_data.append(url_data) return accommodations_data |
So scrape just the number of results passed as argument to your function. Here scrape_accommodation_data(url) will visit the accommodation link and extract the data you want returning it as a Python dictionary.
Scrape Data from Accommodation Links
As we said earlier, the data that we’re going to scrape from each accommodation is the following:
- Name
- Location
- Popular Facilities
- Review Score
We will need to use the find_element and find_elements functions in order to achieve this.
First, create a Python dictionary so you can store the data there.
1 2 |
accommodation_fields = dict() |
Then, tell your WebDriver to visit the accommodation url:
1 2 3 |
driver.get(accommodation_url) time.sleep(10) |
Here, use time.sleep(10) to tell Python that wait 10 seconds so the webpage can load correctly. We could use WebDriverWait, but we are going to scrape several similar pages and the elements that the WebDriverWait is waiting will always be ready, so it’s a better option to use the time library.
The next code is the one we’re going to use to extract each piece of information we want from the accommodations:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
# Get the accommodation name accommodation_fields['name'] = driver.find_element_by_id('hp_hotel_name')\ .text.strip('Hotel') # Get the accommodation score accommodation_fields['score'] = driver.find_element_by_class_name( 'bui-review-score--end').find_element_by_class_name( 'bui-review-score__badge').text # Get the accommodation location accommodation_fields['location'] = driver.find_element_by_id('showMap2')\ .find_element_by_class_name('hp_address_subtitle').text # Get the most popular facilities accommodation_fields['popular_facilities'] = list() facilities = driver.find_element_by_class_name('hp_desc_important_facilities') for facility in facilities.find_elements_by_class_name('important_facility'): accommodation_fields['popular_facilities'].append(facility.text) |
Let’s explain the code line by line:
-
123accommodation_fields['name'] = driver.find_element_by_id('hp_hotel_name')\.text.strip('Hotel')
To find the accomodation name, use the find_element_by_id(id) function, after that call its text attribute and strip the word “Hotel” from it.
-
1234accommodation_fields['score'] = driver.find_element_by_class_name('bui-review-score--end').find_element_by_class_name('bui-review-score__badge').text
The accommodation score is located in a kind of floating element, here use the find_element_by_class_name(class_name) function to find the outer element and then an inner element that is the one that contains the accommodation score. -
123accommodation_fields['location'] = driver.find_element_by_id('showMap2')\.find_element_by_class_name('hp_address_subtitle').text
The accommodation’s location value is just below its name; if you inspect the HTML code, you will find out that it has a unique ID that you can use to find it.
-
123for facility in facilities.find_elements_by_class_name('important_facility'):accommodation_fields['popular_facilities'].append(facility.text)
For the facilities, you need to extract all the elements with the class name “important_facility”, that is why we use the find_elements_by_class_name(class_name) function, we iterate over the list that this is going to return and extract the text from each element.
Let’s see the complete code for this function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
def scrape_accommodation_data(driver, accommodation_url): '''Visits an accommodation page and extracts the data.''' if driver == None: driver = prepare_driver(accommodation_url) driver.get(accommodation_url) time.sleep(12) accommodation_fields = dict() # Get the accommodation name accommodation_fields['name'] = driver.find_element_by_id('hp_hotel_name')\ .text.strip('Hotel') # Get the accommodation score accommodation_fields['score'] = driver.find_element_by_class_name( 'bui-review-score--end').find_element_by_class_name( 'bui-review-score__badge').text # Get the accommodation location accommodation_fields['location'] = driver.find_element_by_id('showMap2')\ .find_element_by_class_name('hp_address_subtitle').text # Get the most popular facilities accommodation_fields['popular_facilities'] = list() facilities = driver.find_element_by_class_name('hp_desc_important_facilities') for facility in facilities.find_elements_by_class_name('important_facility'): accommodation_fields['popular_facilities'].append(facility.text) return accommodation_fields |
Run the Script
Since you have all the functions you need for your scraping process, it is time to tell your script the order in which they need to be executed.
1 2 3 4 5 6 7 8 9 10 11 12 |
if __name__ == '__main__': try: driver = prepare_driver(domain) fill_form(driver, 'Barcelona') accommodations_data = scrape_results(driver, 10) accommodations_data = json.dumps(accommodations_data, indent=4) with open('booking_data.json', 'w') as f: f.write(accommodations_data) finally: driver.quit() |
Here, call all your functions and receive the data you want from the accommodations. Then, using the json Python module, convert it in a json object and write it into a file.
Also, here you can change all of the functions parameters if you want; you can search for another city or another number of accommodations.
Complete Code of Selenium Web Scraping Tutorial
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
import selenium import json import time import re import string import requests import bs4 from selenium.webdriver import Firefox from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.firefox.options import Options from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait domain = 'https://www.booking.com' def prepare_driver(url): '''Returns a Firefox Webdriver.''' options = Options() # options.add_argument('-headless') driver = Firefox(executable_path='geckodriver', options=options) driver.get(url) wait = WebDriverWait(driver, 10).until(EC.presence_of_element_located( (By.ID, 'ss'))) return driver def fill_form(driver, search_argument): '''Finds all the input tags in form and makes a POST requests.''' search_field = driver.find_element_by_id('ss') search_field.send_keys(search_argument) # We look for the search button and click it driver.find_element_by_class_name('sb-searchbox__button')\ .click() wait = WebDriverWait(driver, timeout=10).until( EC.presence_of_all_elements_located( (By.CLASS_NAME, 'sr-hotel__title'))) def scrape_results(driver, n_results): '''Returns the data from n_results amount of results.''' accommodations_urls = list() accommodations_data = list() for accomodation_title in driver.find_elements_by_class_name('sr-hotel__title'): accommodations_urls.append(accomodation_title.find_element_by_class_name( 'hotel_name_link').get_attribute('href')) for url in range(0, n_results): if url == n_results: break url_data = scrape_accommodation_data(driver, accommodations_urls[url]) accommodations_data.append(url_data) return accommodations_data def scrape_accommodation_data(driver, accommodation_url): '''Visits an accommodation page and extracts the data.''' if driver == None: driver = prepare_driver(accommodation_url) driver.get(accommodation_url) time.sleep(12) accommodation_fields = dict() # Get the accommodation name accommodation_fields['name'] = driver.find_element_by_id('hp_hotel_name')\ .text.strip('Hotel') # Get the accommodation score accommodation_fields['score'] = driver.find_element_by_class_name( 'bui-review-score--end').find_element_by_class_name( 'bui-review-score__badge').text # Get the accommodation location accommodation_fields['location'] = driver.find_element_by_id('showMap2')\ .find_element_by_class_name('hp_address_subtitle').text # Get the most popular facilities accommodation_fields['popular_facilities'] = list() facilities = driver.find_element_by_class_name('hp_desc_important_facilities') for facility in facilities.find_elements_by_class_name('important_facility'): accommodation_fields['popular_facilities'].append(facility.text) return accommodation_fields if __name__ == '__main__': try: driver = prepare_driver(domain) fill_form(driver, 'Barcelona') accommodations_data = scrape_results(driver, 10) accommodations_data = json.dumps(accommodations_data, indent=4) with open('booking_data.json', 'w') as f: f.write(accommodations_data) finally: driver.quit() |
After our script finishes its execution, you will have a booking_data.json file in our working folder.
I hope this tutorial has helped you learn more about Selenium, Python and web scraping in general.
Hello! My name is Oswaldo; I’m a Mathematics student from Venezuela. I’m a Python programmer interested in Web Scraping, Machine learning and Mobile Development.
I like maths, coding and problem solving!