Selenium: Web Scraping Booking.com Accommodations

Booking.com is a travel fare aggregator website and travel metasearch engine for lodging reservations. This websites has more than 29,094,365 listings in 230 countries and territories worldwide.

Websites like Booking.com contains a lot of data that can be scraped and processes that can be automatized.

In this Selenium tutorial, will learn how to automate an accommodation search and to scrape the results using Python with Selenium.

We could use the Booking API to make all this process, but in this tutorial is want to help you learn Selenium in a practical way so you can build something useful and learn at the same time.

Let’s start working!

Prepare Workspace

For this tutorial, we will be using Python 3.7.1 and Selenium. You will also need Firefox or Google Chrome in order to run the Selenium WebDriver.

Create Virtual Environment

Although it is optional, it is recommended you create a virtual environment for this project using virtualenv

 

Inside your folder bookingSelenium, activate the virtual environment using:

 

Install Selenium

All what you need for this project is Selenium. You can install Selenium with any Python package manager as pip :

Scraping Process

There are several ways in which you can scrape websites. Since we are working with Selenium, we can handle Javascript in the pages and scrape them in a very direct way. Let’s see how many steps will be taking for this scraping process:

  1. Let your Selenium WebDriver enter in the domain (booking.com).
  2. Perform a search in the main page with the parameters that the script will be receiving.
  3. When the search results are ready, scrape all the data in those links.
  4. When you reach the amount of results needed, stop the scraping and import those results to JSON format.

 

Prepare WebDriver

What is WebDriver?

Selenium is a browser automation tool that controls web browser instances and make it easy to do repetitive tasks. The Python Selenium API has the WebDriver class that helps you to write the instructions for the browser in Python.

A WebDriver object is just a Python class that is linked to a browser process and gives the programmer the ease to control the browser state through Python code.

 

Download GeckoDriver

What is Gecko and GeckoDriver? Gecko is a web browser engine used in some browsers such as Firefox. GeckoDriver acts as the link between your scripts in Selenium and the Firefox browser.

Download your operating system compatible GeckoDriver at : https://github.com/mozilla/geckodriver/releases

If you are in a Archlinux derived distribution you can use any package manager to download the Geckodriver package:

After the installation you need to know where your geckodriver is located. In Linux you can use the which  command to the location of any script or program in your system:

This location needs to be in the system path to be able to use the geckodriver:

Now we can use the geckodriver in our script.

Import Required Classes

To use Selenium WebDriver class, import these classes from this package:

Let’s explain the Python packages needed to use Selenium API one by one:

  • from selenium.webdriver import Firefox specifies that the browser that you want to automate will be an instance of Firefox web browser. For using a Chrome browser, import selenium.webdriver.Chrome instead.
  • from selenium.webdriver.common.by import By helps you locate elements in a webpage by its tag name, class name, css selector and xpath.
  • from selenium.webdriver.firefox.options import Options can hold a list of arguments that will be passed to your Firefox WebDriver.
  • from selenium.webdriver.support import expected_conditions as EC  allows you to define conditions for our browser.
  • from selenium.webdriver.support.wait import WebDriverWait  allows you to define implicit and explicit waits.

 

1. Browse Website with Selenium WebDriver

• Headless: In this tutorial, we will use our browser in the headless mode, this way the browser will run normally but without any visible graphical user interface components. Though not useful for surfing the web, it comes into its own with automatization.

In order to Firefox to run in headless mode we’ll need to create an Options  object and add the -headless  argument to it.

• GeckoDriver Path: Specify the GeckoDriver location (that you have downloaded in the Preparing WebDriver section of this tutorial) passing it in the executable_path argument. With this, we’ll have our WebDriver ready and waiting for instructions.

• URL: Our Selenium WebDriver object is just like a normal browser, so it can do everything a normal browser do. One of the most common task is visit a URL, this can be performed just with one line of code:

The get()  method tells our WebDriver to visit a URL and nothing more.

Our WebDriver will be visiting booking.com and from there we’ll start the scraping process.

• Wait: We need to wait until the main page’s search bar is available to continue, for that we’re using the WebDriverWait class which defines an explicit wait in our WebDriver. How do we know that we need to wait for the element with the ID ss ? Well, since we need the search bar ready to make the search, that’s the element we are telling our WebDriver to wait. We know that it has the ss ID because we perform a simple “Inspect element” on it (this will be explained in detail later).

 

2. Perform a Search in Booking.com Homepage

How can you tell your WebDriver where to click or where to insert text? The WebDriver class has the find_element functions which allows you to find any element inside the current page.

There are several different functions depending on the way you look for elements in the page. You can look for elements using its class name, ID, tag name, XPath selector, link text, partial link text, name and css selector.

  • find_element_by_id
  • find_element_by_name
  • find_element_by_xpath
  • find_element_by_link_text
  • find_element_by_partial_link_text
  • find_element_by_tag_name
  • find_element_by_class_name
  • find_element_by_css_selector

To know which of these functions is better to use, you need to take a look at the page’s HTML code. This will tell you the more precise way to find the element you want.

You just need to do a right-click in the search form and click “Inspect element”. This will lead you to the element’s HTML code:

The highlighted line is the search form’s HTML code, let’s take a closer look:

Here, you can see that the element’s ID is ss , knowing this you can use one of the find_element functions to tell our WebDriver the element it needs to locate.

 

Let’s  define a function that will perform a search in the Booking.com main page:

Pass your web driver and a string as arguments; this string will be the city in which you want to look for accommodations. Let’s review each line inside this function:

  • search_field = driver.find_element_by_id('ss') makes use of the driver.find_element_by_id() function that looks for an element with the ID passed as argument; as we know the IDs are unique so it’ll return the search bar.
  • search_field.send_keys(search_argument) since you already have the search bar selected, you have to tell your WebDriver to put some text on it. Here we use the send_keys(string) function which takes a string as argument and put it in the search form.
  • driver.find_element_by_class_name('sb-searchbox__button').click() since you have inserted the city you want to search in the search bar, you need to click the “Search” button to perform the search. Here we use the find_element_by_class_name() function which receives a string representing the class element we’re looking for and then we call the click()  function which simply perform a click on the selected element.
  • wait = WebDriverWait(driver, timeout=10).until(EC.presence_of_all_elements_located( (By.CLASS_NAME, 'sr-hotel__title'))) here you are telling your WebDriver to wait until the elements with the class name sr-hotel-title (the one containing the accommodations titles) appear.

After this function completes its process, your WebDriver will be in the search results page showing something like this:

 

3. Scrape the Results

Since you have already performed your search, you can start to visit each hotel link and extract the data you need.

For the accommodations we’ll be extracting:

  • Name
  • Location
  • Popular Facilities
  • Review Score

Create a function that will extract a predetermined number of accommodation links and then scrape the data you want from them.

Let’s define other two functions, one will extract the links and the other will scrape the data from each link.

 

Extract accommodation links

You need to know how to extract all of the accommodations’ links in the search results page. Fortunately, Selenium have the find_elements functions that work just as find_element functions but finding all the elements with the specified feature instead of just one.

The syntax of  find_elements functions is very similar; the only word that changes is “element” to “elements”:

  • find_elements_by_name
  • find_elements_by_xpath
  • find_elements_by_link_text
  • find_elements_by_partial_link_text
  • find_elements_by_tag_name
  • find_elements_by_class_name
  • find_elements_by_css_selector

We don’t have find_elements_by_id() function since the IDs are unique and there cannot be two elements with the same ID.

Using  find_elements functions, you can now extract the accommodations links in the search results page. Inspecting one of the accommodations title, we find out that they share a common class which is sr-hotel__title

 

Use find_elements_by_class_name to select all of the h3 elements and then find the anchor tag inside these to extract the accommodation url :

That line will return a list of elements (h3 elements) and for each element will use again the find_element_by_class_name() function to find the anchor tag and extract its href attribute:

So scrape just the number of results passed as argument to your function. Here scrape_accommodation_data(url) will visit the accommodation link and extract the data you want returning it as a Python dictionary.

 

Scrape Data from Accommodation Links

As we said earlier, the data that we’re going to scrape from each accommodation is the following:

  • Name
  • Location
  • Popular Facilities
  • Review Score

We will need to use the find_element and find_elements functions in order to achieve this.

 

First, create a Python dictionary so you can store the data there.

 

Then, tell your WebDriver to visit the accommodation url:

Here, use time.sleep(10) to tell Python that wait 10 seconds so the webpage can load correctly. We could use WebDriverWait, but we are going to scrape several similar pages and the elements that the WebDriverWait is waiting will always be ready, so it’s a better option to use the time library.

The next code is the one we’re going to use to extract each piece of information we want from the accommodations:

 

Let’s explain the code line by line:


  • To find the accomodation name, use the find_element_by_id(id) function, after that call its text attribute and strip the word “Hotel” from it.

 


  • The accommodation score is located in a kind of floating element, here use the find_element_by_class_name(class_name) function to find the outer element and then an inner element that is the one that contains the accommodation score.

  • The accommodation’s location value is just below its name; if you inspect the HTML code, you will find out that it has a unique ID that you can use to find it.

 


  • For the facilities, you need to extract all the elements with the class name “important_facility”, that is why we use the find_elements_by_class_name(class_name) function, we iterate over the list that this is going to return and extract the text from each element.

 

Let’s see the complete code for this function:

 

Run the Script

Since you have all the functions you need for your scraping process, it is time to tell your script the order in which they need to be executed.

Here, call all your functions and receive the data you want from the accommodations. Then, using the json Python module, convert it in a json object and write it into a file.

Also, here you can change all of the functions parameters if you want; you can search for another city or another number of accommodations.

Complete Code of Selenium Web Scraping Tutorial

After our script finishes its execution, you will have a booking_data.json file in our working folder.

 

I hope this tutorial has helped you learn more about Selenium, Python and web scraping in general.

 

Rating: 4.3/5. From 4 votes.
Please wait...

Leave a Reply