Selenium: Scraping Booking.com Accommodations

Booking.com is a travel fare aggregator website and travel metasearch engine for lodging reservations. This websites has more than 29,094,365 listings in 230 countries and territories worldwide.

Websites like Booking.com contains a lot of data that can be scraped and processes that can be automatized.

In this Selenium tutorial, will learn how to automate an accommodation search and to scrape the results using Python with Selenium.

We could use the Booking API to make all this process but in this tutorial is want to help you learn Selenium in a practical way so you can build something useful and learn at the same time.

Let’s start working!

Prepare Workspace

For this tutorial, we will be using Python 3.7.1 and Selenium. You will also need Firefox or Google Chrome in order to run the Selenium WebDriver.

Create Virtual Environment

Although it is optional, we recommend you to create a virtual environment for this project using virtualenv

 

Inside your folder bookingSelenium, activate the virtual environment using:

 

Install Selenium

As we said, we’ll need Selenium for this project, you can install it with any Python package manager as pip :

Scraping Process

There are several ways in which you can scrape websites; since we are working with Selenium, we can handle Javascript in the pages and scrape them in a very direct way. Let’s see how many steps will be taking for this scraping process:

  1. Let your Selenium WebDriver enter in the domain (booking.com).
  2. Perform a search in the main page with the parameters that the script will be receiving.
  3. When the search results are ready, scrape all the data in those links.
  4. When you reach the amount of results needed, stop the scraping and import those results to JSON format.

 

Prepare WebDriver

What is WebDriver?

Selenium is a browser automation tool that control web browser instances and make it easy to do repetitive tasks. The Python Selenium API has the WebDriver class that helps you to write the instructions for the browser in Python.

A WebDriver object is just a Python class that is linked to a browser process and gives the programmer the ease to control the browser state through Python code.

 

Download GeckoDriver

What is Gecko and GeckoDriver? Gecko is a web browser engine used in some browsers such as Firefox. GeckoDriver acts as the link between your scripts in Selenium and the Firefox browser.

Download your operating system compatible GeckoDriver at : https://github.com/mozilla/geckodriver/releases

If you are in a Archlinux derived distribution you can use any package manager to download the Geckodriver package:

After the installation you need to know where your geckodriver is located. In Linux you can use the which  command to the location of any script or program in your system:

This location needs to be in the system path to be able to use the geckodriver:

Now we can use the geckodriver  in our script.

Import Required Classes

To use Selenium WebDriver class, import these classes from this package:

Let’s explain the Python packages needed to use Selenium API one by one:

  • from selenium.webdriver import Firefox specifies that the browser that you want to automate will be an instance of Firefox web browser. For using a Chrome browser, import selenium.webdriver.Chrome instead.
  • from selenium.webdriver.common.by import By helps you locate elements in a webpage by its tag name, class name, css selector and xpath.
  • from selenium.webdriver.firefox.options import Options can hold a list of arguments that will be passed to your Firefox WebDriver.
  • from selenium.webdriver.support import expected_conditions as EC  allows you to define conditions for our browser.
  • from selenium.webdriver.support.wait import WebDriverWait  allows you to define implicit and explicit waits.

 

1. Browse Website with Selenium WebDriver

• Headless: In this tutorial, we will use our browser in the headless mode, this way the browser will run normally but without any visible graphical user interface components. Though not useful for surfing the web, it comes into its own with automatization.

In order to Firefox to run in headless mode we’ll need to create an Options  object and add the -headless  argument to it.

• GeckoDriver Path: Specify the GeckoDriver location (that you have downloaded in the Preparing WebDriver section of this tutorial) passing it in the executable_path argument. With this, we’ll have our WebDriver ready and waiting for instructions.

• URL: Our Selenium WebDriver object is just like a normal browser, so it can do everything a normal browser do. One of the most common task is visit a URL, this can be performed just with one line of code:

The get()  method tells our WebDriver to visit a URL and nothing more.

Our WebDriver will be visiting booking.com and from there we’ll start the scraping process.

• Wait: We need to wait until the main page’s search bar is available to continue, for that we’re using the WebDriverWait class which defines an explicit wait in our WebDriver. How do we know that we need to wait for the element with the ID ss ? Well, since we need the search bar ready to make the search, that’s the element we are telling our WebDriver to wait. We know that it has the ss ID because we perform a simple “Inspect element” on it (this will be explained with detail down below).

 

2. Perform a Search in Booking.com Homepage

How do we tell our WebDriver where to click or where in insert text? The WebDriver class has the find_element  functions which allow us to find any element inside the current page.

There are several different functions depending on the way we look for elements in the page. We can look for elements using its class name, ID, tag name, XPath selector, link text, partial link text, name and css selector.

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

To know which of this functions is better to use we need to take a look at the page’s HTML code. This will tell us the more precise way to find the element we want.

We just need to do a second click in the search form and click “Inspect element”. This will lead us to the element’s HTML code:

The highlighted line is the search form’s HTML code, let’s take a closer look:

Here we can see that the element’s ID is ss , knowing this we can use one of the find_element functions to tell our WebDriver the element it needs to locate.

Let’s  define a function that will perform a search in the Booking.com main page:

We pass our web driver and a string as arguments, this string will be the city in which we want to look for accommodations. Let’s review each line inside this function:

  • search_field = driver.find_element_by_id('ss') : this line makes use of the driver.find_element_by_id()  function that look for an element with the ID passed as argument, as we know the IDs are unique so it’ll return the search bar.
  • search_field.send_keys(search_argument) : since we already have the search bar selected, we have to tell our WebDriver to put some text on it. Here we use the send_keys(string)  function which takes a string as argument and put it in the search form.
  • driver.find_element_by_class_name('sb-searchbox__button').click() : since we had inserted the city we want to search in the search bar, we need to click the “Search” button to perform the search. Here we use the find_element_by_class_name()  function which receives a string representing the class element we’re looking for and then we call the click()  function which simply perform a click on the selected element.
  • wait = WebDriverWait(driver, timeout=10).until(EC.presence_of_all_elements_located( (By.CLASS_NAME, 'sr-hotel__title'))) : here we are telling our WebDriver to wait until the elements with the class name sr-hotel-title (the one containing the accommodations titles) appear.

After this function completes its proccess, our WebDriver will be in the search results page showing us something like this:

3. Scrape the results

Since we have already performed our search we can start to visit each hotel link and extract the data we need.

For the accommodations we’ll be extracting:

  • Name
  • Location
  • Popular Facilities
  • Review Score

We’ll create a function that will extract a predetermined number of accommodation links and then scrape the data we want from them.

Let’s define another two functions, one will extract the links and the other will scrape the data from each link.

Extract accommodation links

We need to know how to extract all of the accommodations links in the search results page. Fortunately, Selenium have the find_elements  functions that works just as find_element  functions but finding all the elements with the specified feature instead of just one.

It’s sintax are very similar, the only word that changes is “element” to “elements”:

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

We don’t have find_elements_by_id()  function since the IDs are unique and it can’t be two elements with the same ID.

Using this function we can now extract the accommodations links in the search results page. Inspecting one of the accommodations title we find out that they share a common class which is sr-hotel__title

we can use find_elements_by_class_name  to select all of the h3 elements and then find the anchor tag inside these to extract the accommodation url :

accommodations_titles = driver.find_elements_by_class_name('sr-hotel__title')
That line will return a list of elements (h3 elements) and for each element will use again the find_element_by_class_name()  function to find the anchor tag and extract its href attribute:

We are going to scrape just the number of results passed as argument to our function. The scrape_accommodation_data(url) will visit the accommodation link and extract the data we want returning it as a Python dictionary.

Scrape Data from Accommodation Links

Has we said earlier, the data that we’re going to scrape from each accommodation is the following:

  • Name
  • Location
  • Popular Facilities
  • Review Score

We will need to use the find_element  and find_elements  functions in order to achieve this.

First we need to create a Python dictionary so we can store the data there.

Then we told our WebDriver to visit the accommodation url:

Here we use time.sleep(10)  to tell Python that wait 10 seconds so the webpage can load correctly, we could use WebDriverWait, but we’re going to scrape several similar pages and the elements that the WebDriverWait is waiting will always be ready, so it’s a better option to use the time  library.

The next code is the one we’re going to use to extract each piece of information we want from the accommodations:

 


  • To find the accomodation name we’re using the find_element_by_id(id) function, after that we call its text attribute and strip the word “Hotel” from it.

  • The accommodation score is located in a kind of floating element, here we use the find_element_by_class_name(class_name)  function to find the outer element and then an inner element that is the one that contains the accommodation score.

  • The accommodation’s location value is just below its name, if we inspect the HTML code will find out that it has a unique ID that we can use to find it.

  • For the facilities we need to extract all the elements with the class name “important_facility”, that’s why we use the find_elements_by_class_name(class_name)  function, we iter over the list that this is going to return and extract the text from each element.

Let’s see the complete code for this function:

Since we have all the functions we need for our scraping process let’s tell our script how the order in which they need to be executed:

Run the script

Here we call all our functions and receive the data we want from the accommodations. Then, using the json Python module we convert it in a json object and write it into a file.

Here we can change all of the functions parameters if we want, we can search for another city or another number of accommodations.

Complete code

After our script finish its execution will have a booking_data.json file in our working folder.

We hope this tutorial has helped you learn more about Selenium, Python and scraping in general.

No votes yet.
Please wait...

Leave a Reply