In this tutorial, you will learn how to use Scrapy to log into websites that require entering a username and password before showing specific pages.
If you are new to Scrapy, please make sure you study the introductory Scrapy tutorial to learn how to create a project and crawler and how to scrape web pages.
In this tutorial, we will use this website: quotes.toscrape.com (a demo website originally developed for learning purposes). As you can see, all pages of the site have a Login button that redirects us to a /login page. At this login page, you can type in any combination of username and password, and when you press the Login button, you will get redirected to the home page that now have the Logout button which means that you are logged in.
On a real website, after you log in, you will have access to various data points, URLs and other things that you will not have access to otherwise, but we do not have this here as this is a demo website.
Analyzing Login Request
Now, let’s start to see how to log in using Scrapy. First of all, make sure you are logged out, open the Login page in your browser, Chrome or Firefox, right-click the page, select “Inspect”, and go to the “Network” tab, where you can analyze the traffic and see what URLs the server is requesting while logging in.
You have two requests in this case, POST and GET. After you click the Login button, the POST method will have a 302 status which means you get redirected from the login page to another page. Here is a screenshot from Chrome:
Click on this request and you will see several tabs including “Headers”, the one you need. Scroll down until the “Form Data” section, where there are three important arguments (on other websites there might be other arguments). The first one is the “csrf_token” that is a token dynamically changed, and a combination of “username” and “password”.
Editing Your Scrapy Code
Back to your code, you need to add a Scrapy submodule called FormRequest. So at the top of your Scrapy spider’s code, type in:
1 2 |
from scrapy.http import FormRequest |
and change the parameter of start_urls to:
1 2 |
start_urls = ('http://quotes.toscrape.com/login',) |
Add your logging in code to the parse() function. In this case, the token csrf_token will change automatically if you refresh the page. So you first need to extract the value of csrf_token from the page source code itself. To do so, “Inspect” the page quotes.toscrape.com/login and you will find an input with csrf_token name.
This XPath Selector will select all HTML nodes whose attribute name equals to csrf_token and extract the first instance of this node. As you have only one instance, this will return the token you need.
1 2 |
token = response.xpath('//*[@name="csrf_token"]/@value').extract_first() |
So for other websites, you can just copy the “Form Data” arguments (as shown in the screenshot above), and if you find one of the details changing, try to inspect the page to find the changing detail in the page source code and extract it into a variable.
Then use return FormRequest including the login details and also the name of the callback function that will determine what you want to do/scrape from the page that you will be redirected to after logging in; here we will call it scrape_pages for example. So your parse function will look like this:
1 2 3 4 5 6 7 8 9 10 |
def parse(self, response): token = response.xpath('//*[@name="csrf_token"]/@value').extract_first() return FormRequest.from_response(response, formdata={ 'csrf_token'=token, 'password'='foobar', 'username'='foobar'}, callback=self.scrape_pages) |
Testing Your Scrapy Logging in Code
If you want to test your code, you can add this line to the top of your code:
1 2 |
from scrapy.utils.response import open_in_browser |
and then at the beginning of the scrape_pages() function, add this line which will open the website you are scraping in your browser, precisely the page that you will be redirected to after logging in:
1 2 |
open_in_browser(response) |
In the same scrape_pages() function, complete the code that scrapes the pages you need after logging in.
Finally, open your Terminal or Command Prompt and use the following command to run your spider; make sure you replace “quotes” with your own spider’s name if it is something else:
1 2 |
scrapy crawl quotes |
If everything is fine, a page will open in your browser showing what your program is scraping. In the current example, it will show the home page of the demo website, including a Logout button at the top, which indicates that you have successfully logged in.
Final Code
This is all for this Scrapy logging in tutorial, and here is the full code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# -*- coding: utf-8 -*- from scrapy import Spider from scrapy.http import FormRequest from scrapy.utils.response import open_in_browser class QuotesSpider(Spider): name = 'quotes' start_urls = ('http://quotes.toscrape.com/login',) def parse(self, response): token = response.xpath('//*[@name="csrf_token"]/@value').extract_first() return FormRequest.from_response(response, formdata={'csrf_token': token, 'password': 'foobar', 'username': 'foobar'}, callback=self.scrape_pages) def scrape_pages(self, response): open_in_browser(response) # Complete your code here to scrape the pages that you are redirected to after logging in # .... # .... |
You can also download the code from Github.
If you have any questions, please feel free to send them in the comments below.
Scrapy Comprehensive Course
This tutorial is part of our comprehensive online course, Scrapy, Powerful Web Scraping & Crawling with Python – get 90% OFF using this coupon.
Full time Web Scraping; worked on projects that deal with automation and website scraping, crawling and exporting data to various data formats. Over the years, worked with 100+ different individuals, and companies and helped them achieve their goals.
Dear Lazar,
Many thanks for this great tutorial. I have tried many and it’s the only one that (almost) worked for a project I am working on.
Here is my issue: When I apply your code to the website I want to scrape the ‘open_in_browser(response)’ tries to open C:/Users/[user]/AppData/Local/Temp/[filename].html
I know that the code before is working fine because when I fill in the wrong password it brings me to the login page perfectly fine (displaying the ‘ wrong password’ message as you’d expect).
Is this some kind of defense against bots logging in to the page or am I overlooking something else? For example, the page that the site directs you (a user dashboard) to after logging in has a different url. Should that be included somewhere in the code?
I am really a rookie at this so any thoughts you may on this issue would be greatly appreciated.
Thanks again and best regards,
Geert
Thanks for your words! It is difficult to see without seeing the website, but generally speaking, maybe. Try Selenium and see if it works.
Thanks for the nice tutorial.
Why do you manually extract the csrf_token?
The method from_response() will do it for you.
Hello Lazar,
thak you for your post, really helpfull 🙂
I’m trying to implement in this page but when i in the form data there is only the timestamp https://aucstore.com/login.html .
You have any idea how could i try ?
Thank You!!
@sf – if Scrapy does not work for your page login, try Selenium.
Yes!! thank you! it works with Selenium