Logging in with Scrapy FormRequest

In this tutorial, you will learn how to use Scrapy to log into websites that require entering a username and password before showing specific pages.

If you are new to Scrapy, please make sure you study the introductory Scrapy tutorial to learn how to create a project and crawler and how to scrape web pages.

In this tutorial, we will use this website: quotes.toscrape.com (a demo website originally developed for learning purposes). As you can see, all pages of the site have a Login button that redirects us to a /login page. At this login page, you can type in any combination of username and password, and when you press the Login button, you will get redirected to the home page that now have the Logout button which means that you are logged in.

On a real website, after you log in, you will have access to various data points, URLs and other things that you will not have access to otherwise, but we do not have this here as this is a demo website.

Analyzing Login Request

Now, let’s start to see how to log in using Scrapy. First of all, make sure you are logged out, open the Login page in your browser, Chrome or Firefox, right-click the page, select “Inspect”, and go to the “Network” tab, where you can analyze the traffic and see what URLs the server is requesting while logging in.

You have two requests in this case, POST and GET. After you click the Login button, the POST method will have a 302 status which means you get redirected from the login page to another page. Here is a screenshot from Chrome:

Click on this request and you will see several tabs including “Headers”, the one you need. Scroll down until the “Form Data” section, where there are three important arguments (on other websites there might be other arguments). The first one is the “csrf_token” that is a token dynamically changed, and a combination of “username” and “password”.

 

 

Editing Your Scrapy Code

Back to your code, you need to add a Scrapy submodule called FormRequest. So at the top of your Scrapy spider’s code, type in:

 

and change the parameter of start_urls to:

 

Add your logging in code to the parse() function. In this case, the token csrf_token will change automatically if you refresh the page. So you first need to extract the value of csrf_token from the page source code itself. To do so, “Inspect” the page quotes.toscrape.com/login and you will find an input with csrf_token name.

 

 

This XPath Selector will select all HTML nodes whose attribute name equals to csrf_token and extract the first instance of this node. As you have only one instance, this will return the token you need.

 

So for other websites, you can just copy the “Form Data” arguments (as shown in the screenshot above), and if you find one of the details changing, try to inspect the page to find the changing detail in the page source code and extract it into a variable.

Then use return FormRequest including the login details and also the name of the callback function that will determine what you want to do/scrape from the page that you will be redirected to after logging in; here we will call it scrape_pages for example. So your parse function will look like this:

 

Testing Your Scrapy Logging in Code

If you want to test your code, you can add this line to the top of your code:

 

and then at the beginning of the scrape_pages() function, add this line which will open the website you are scraping in your browser, precisely the page that you will be redirected to after logging in:

 

In the same scrape_pages() function, complete the code that scrapes the pages you need after logging in.

 

Finally, open your Terminal or Command Prompt and use the following command to run your spider; make sure you replace “quotes” with your own spider’s name if it is something else:

 

If everything is fine, a page will open in your browser showing what your program is scraping. In the current example, it will show the home page of the demo website, including a Logout button at the top, which indicates that you have successfully logged in.

Final Code

This is all for this Scrapy logging in tutorial, and here is the full code:

 

You can also download the code from Github.

 

If you have any questions, please feel free to send them in the comments below.

Scrapy Comprehensive Course

This tutorial is part of our comprehensive online course, Scrapy, Powerful Web Scraping & Crawling with Python – get 90% OFF using this coupon.

 

 

No votes yet.
Please wait...

Leave a Reply