Scrapy Tutorial: Web Scraping Craigslist

In this Scrapy tutorial, you will learn how to write a Craigslist crawler to scrape Craigslist‘s “Architecture & Engineering” jobs in New York and store the data to a CSV file.

This tutorial is one lecture of our comprehensive Scrapy online course on Udemy, Scrapy: Powerful Web Scraping & Crawling with Python

Scrapy Tutorial Getting Started

As you may already know, Scrapy is one of the most popular and powerful Python scraping frameworks. In this Scrapy tutorial we will explain how to use it on a real-life project, step by step.

Scrapy Installation

You can simply install Scrapy using pip with the following command:

If you are on Linux or Mac, you might need to start the command with sudo as follows:

This will install all the dependencies as well.

Creating a Scrapy Project

Now, you need to create a Scrapy project. In your Terminal/CMD navigate to the folder in which you want to save your project and then run the following command:

 

Note that for all Scrapy commands, we start with scrapy followed by the command which is here startproject for creating a new Scrapy project and here it should be followed by the project name which can be anything; we called it “craigslist” just for example.

So typically any Scrapy command will look like this:

Creating a Scrapy Spider

In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it craigslist, the folder would be with the same name and the command should simply be:

 

After that, create the spider using the genspider command and give it any name you like; here we will call it jobs. Then, it should be followed by the URL.

Editing the Scrapy Spider

Now, manually navigate to the craigslist folder, i.e. the scrapy project. You can see that it is structured as in this screenshot.

Web Scraping Craigslist with Scrapy and Python

For now, we will be concentrating on the spider file, which is here called jobs.py – open it in any text editor and you will have the following:

 

Understanding the Scrapy Spider File

Let’s check the parts of the main class of the file automatically generated for our jobs Scrapy spider:

1- name  of the spider.

2- allowed_domains  the list of the domains that the spider is allowed scrape.

3- start_urls  the list of one or more URL(s) with which the spider starts crawling.

4- parse  the main function of the spider. Do NOT change its name; however, you may add extra functions if needed.

 

Warning: Scrapy adds extra http:// at the beginning of the URL in start_urls  and it also adds a trailing slash. As we here already added https:// while creating the spider, we must delete the extra http://. So double-check that the URL(s) in start_urls are correct or the spider will not work.

Craigslist Scrapy Spider #1 – Titles

If you are new to Scrapy, let’s start by extracting and retrieving only one element for the sake of clarification. We are just starting with this basic spider as a foundation for more sophisticated spiders in this Scrapy tutorial. This simple spider will only extract job titles.

Editing the parse() Function

Instead of pass, add this line to the parse() function:

 

What does this mean?

titles  is a [list] of text portions extracted based on a rule.

response  is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because if you print(response) you will get something like <200 https://newyork.craigslist.org/search/egr> which means “you have managed to connect to this web page”; however, if you print(response.body) you will get the whole source code. Anyhow, when you use XPath expressions to extract HTML nodes, you should directly use response.xpath()

xpath  is how we will extract portions of text and it has rules. XPath is a detailed topic and we will dedicate a separate article for it. But generally try to notice the following:

Open the URL in your browser, move the cursor on any job title, right-click, and select “Inspect“. You can see now the HTML code like this:

So, you want to extract “Chief Engineer” which is the text of an <a> tag, and as you can see this <a> tag has the class “result-title hdrlnk” which can distinguish it from other <a> tags on the web-page.

Let’s explain the XPath rule we have:

//  means instead of starting from the <html>, just start from the tag that I will specify after it.

/a  simply refers to the <a> tag.

[@class="result-title hdrlnk"]  that is directly comes after /a means the <a> tag must have this class name in it.

text()  refers to the text of the  <a> tag, which is”Chief Engineer”.

extract()  means extract every instance on the web page that follows the same XPath rule into a [list].

extract_first()  if you use it instead of extract() it will extract only the first item in the list.

 

Now, you can print titles:

 

Your “basic Scrapy spider” code should now look like this:

 

Running the Scrapy Spider

Now, move to your Terminal, make sure you are in the root directory of your Scrapy project craigslist and run the spider using the following command:

 

In Terminal, you will get a similar result; it is a [list] including the job titles (the titles can vary from day to day):

[u'Junior/ Mid-Level  Architect for Immediate Hire', u'SE BUSCA LLANTERO/ LOOKING FOR TIRE CAR WORKER CON EXPERIENCIA', u'Draftsperson/Detailer', u'Controls/ Instrumentation Engineer', u'Project Manager', u'Tunnel Inspectors - Must be willing to Relocate to Los Angeles', u'Senior Designer - Large Scale', u'Construction Estimator/Project Manager', u'CAD Draftsman/Estimator', u'Project Manager']

As you can see, the result is a list of Unicode strings. So you can loop on them, and yield one title per time in a form of dictionary.

 

Your “basic Scrapy spider” code should now look like this:

Storing the Scraped Data to CSV

You can now run your spider and store the output data into CSV, JSON or XML. To store the data into CSV, run the following command in Terminal. The result will be a CSV file called result-titles.csv in your Scrapy spider directory.

Your Terminal should show you a similar result, which indicates success

'item_scraped_count' refers to the number of titles scraped from the page. 'log_count/DEBUG'  and 'log_count/INFO' are okay; however, if you received 'log_count/ERROR' you should find out which errors you get during scraping are fix your code.

In Terminal, you will notice debug messages like:

The status code 200 means the request has succeeded. Also,  'downloader/response_status_count/200' tells you how many requests succeeded. There are many other status codes with different meanings; however, in web scraping they could act as a defense mechanism against web scraping.

Craigslist Scrapy Spider #2 – One Page

In the second part of this Scrapy tutorial, we will scrape the details of Craigslist’s “Architecture & Engineering” jobs in New York. For now, you will start by only one page. In the third part of the tutorial, you will learn how to navigate to next pages.

Before starting this Scrapy exercise, it is very important to understand the main approach:

The Secret: Wrapper

In the first part of this Scrapy tutorial, we extracted titles only. However, if you want to scrape several details about each job, you will not extract them separately, and then loop on each of them. No! Actually, you scrape the whole “container” or “wrapper” of each job including all the information you need, and then extract pieces of information from each container/wrapper.

To see how this container/wrapper looks like, right-click any job on the Craigslist’s page and select “Inspect”; you will see this:

As you can see, each result is inside an HTML list <li> tag.

If you expand the <li> tag, you will see this HTML code:

As you can see, the <li> tag includes all the information you need; so you can consider it your “wrapper”. Actually, you can even start from the <p> tag which includes the same information you need. The <li> is distinguished by the class “result-row” while the <p> is distinguished by the class “result-info” so each of them is unique, and can be easily distinguished  by your XPath expression.

Note: If you are using the same spider from the Basic Scrapy Spider, delete any code under the parse() function, and start over, or just copy the file into the same “spiders” folder and change  name = "jobs" in the basic spider to anything else like name = "jobs-titles" and keep the new one name = "jobs" as is.

Extracting All Wrappers

As we agreed, you first need to scrape all the wrappers from the page. So under the parse() function, write the following:

Note that here you will not use extract() because it is the wrapper from which you will extract other HTML nodes.

Extracting Job Titles

You can extract the job titles from the wrappers using a for loop as follows:

The first observation is that in the for loop, you do not use “response” (which you already used to extract the wrapper). Instead, you use the wrapper selector which are referred to as “job”.

Also, as you can see, we started the XPath expression of “jobs” by // meaning it starts from <html> until this <p> whose class name is  “result-info”.

However, we started the XPath expression of “title” without any slashes, because it complements or depends on the XPath expression of the job wrapper. If you rather want to use slashes, you will have to precede it with a dot to refer to the current node as follows:

As we explained in the first part of this Scrapy tutorial, a refers to the first <a> tag inside the <p> tag, and text() refers to the text inside the <a> tag which is the job title.

Here, we are using extract_first() because in each iteration of the loop, we are in a wrapper with only one job.

What is the difference between the above code and what we had in the first part of the tutorial? So far, this will give the same result. However, as you want to extract and yield more than one element from the wrapper, this approach is more straightforward because now, you can extract other elements like the address and URL of each job.

Extracting Job Addresses and URLs

You can extract the job address and URL from the wrappers using the same for loop as follows:

To extract the job address, you refer to the <span> tag whose class name is “result-meta” and then the <span> tag whose class name is “result-hood” and then the text() in it. The address is between brackets like (Brooklyn); so if you want to delete them, you can use string slicing [2:-1]. However, this string slicing will not work if there is no address (which is the case for some jobs) because the value will be None which is not a string! So you have to add empty quotes inside extract_first("") which means if there is no result, the result is “”.

To extract the job URL, you refer to the <a> tag and the value of the href attribute, which is the URL. Yes, @ means an attribute.

However, this is a relative URL, which looks like: /brk/egr/6112478644.html so, to get the absolute URL to be able to use it later, you can either use Python concatenation as follows:

Note: Concatenation is another case in which you need to add quotes to extract_first("") because concatenation works only on strings and cannot work on None values.

Otherwise, you can simply use the urljoin() method, which builds a full absolute URL:

Finally, yield your data using a dictionary:

Running the Spider and Storing Data

Just as we did in the first part of this Scrapy tutorial, you can use this command to run your spider and store the scraped data to a CSV file.

Craigslist Scrapy Spider #3 – Multiple Pages

To do the same for all the result pages of Craigslist’s Architecture & Engineering jobs, you need to extract the “next” URLs and then apply the same parse() function on them.

Extracting Next URLs

To extract the “next” URLs, right-click the one in the first page, and “Inspect” it. Here is how the HTML code looks like:

 

So here is the code you need. Be careful! You should now get out of the for loop.

Note that the next URL is in an <a> tag whose class name is “button next”. You need to extract the value of the attribute href so that is why you should use @href

Just as you did in the previous part of this Scrapy tutorial, you need to extract the absolute next url using the urljoin() method.

Finally, yield the Request() method with the absolute_next_url and this requires a callback function, which means a function to apply on this URL; in this case, it is the same parse() function which extracts the titles, addresses and URLs of jobs from each page. Note that in the case the parse() function is the callback, you can delete this callback=self.parse because the parse() function is a callback by default, even if you do not explicitly state that.

Of course, you must import the Request method before using it:

Running the Spider and Storing Data

Again, just as we did in the first and second parts of this Scrapy tutorial, you can run your spider and save the scraped data to a CSV file using this command:

Craigslist Scrapy Spider #4 – Job Descriptions

In this spider, we will open the URL of each job and scrape its description. The code we wrote so far will change; so if you like, you can copy the file into the same “spiders” folder, and change the spider name of the previous one to be something like name = "jobsall" and you can keep the new file name = "jobs" as is.

Passing the Data to a Second Function

In the parse() function, we had the following yield in the for loop:

We yielded the data at this stage because we were scraping the details of each job from one page, but now you need to create a new function called for example parse_page() to open the URL of each job and scrape the job description from each job dedicated page; so you have to pass the URL of the job from the parse() function to the parse_page() function in a callback using the Request() method as follows:

What about the data you have already extracted? You need to pass those values of titles and addresses from the parse() function to the parse_page() function as well, using meta in a dictionary as follows:

Function to Extract Job Descriptions

So the parse_page() function will be as follows:

Here, you should deal with meta as a dictionary and assign each value to a new variable. To get the value of each key in the dictionary, you can use the Python’s dictionary  get() method as usual.

Otherwise, you can use the other way around by adding the new value “description” to the meta dictionary and then simply yield meta to retrieve all the data as follows:

Note that a job description might be in more than one paragraph;  so we used join() to merge them.

Now, we can also extract “compensation” and “employment type”. They are in a <p> tag whose class name is “attrgroup” and then each of them is in a <span> tag, but with no class or id to distinguish them.

So we can use span[1] and span[2] to refer to the first <span> tag of “compensation” and the second <span> tag of “employment type” respectively.

Actually, this is the same as if you use list slicing. In this case, you will use extract() not extract_first() and follow the expression with [0] or [1] which can be added directly after the expression as shown below or even after extract(). Note that in the previous way, we counted the tags from [1] while in this way, we are counting from [0] just as any list.

Storing Scrapy Output Data to CSV, XML or JSON

You can run the spider and save the scraped data to either CSV, XML or JSON as follows:

Editing Scrapy settings.py

In the same Scrapy project folder, you can find a file called settings.py which includes several options you can use. Let’s explain a couple of regularly used options. Note that sometimes the option is already in the file and you have just to uncomment it, and sometimes you have to add it yourself.

Arranging Scrapy Output CSV Columns

As you might have noticed by now, Scrapy does not arrange the CSV columns in the same order of your dictionary.

Open the file settings.py found in the Scrapy project folder and add this line; it will give you the order you need:

 

Note: If your run the command of generating the CSV again, make sure you first delete the current CSV file or it will add the new output at the end of the current one.

Setting a User Agent

If you are using a regular browser to navigate a website, your web browser will send what is known as a “User Agent” for every page you access. So it is recommended to use the Scrapy USER_AGENT option while web scraping to make it look more natural. The option is already in Scrapy settings.py so you can enable it by deleting the # sign.

You can find lists of user agents online; select a recent one for Chrome or Firefox. You can even check your own user agent; however, it is not required to use the same one in Scrapy.

Here is an example:

Setting Scrapy DOWNLOAD_DELAY

The option DOWNLOAD_DELAY is already there in Scrapy settings.py so you can just enable it by deleting the # sign. According to Scrapy documentation, “this can be used to throttle the crawling speed to avoid hitting servers too hard.”

You can see that the offered number is 3 seconds; however, you can make them less or more. Still, this makes sense because there is another option that is activated by default which is RANDOMIZE_DOWNLOAD_DELAY and it is set from 0.5 to 1.5 seconds.

Final Scrapy Tutorial Spider Code

So the whole code of this Scrapy tutorial is as follows. Try it yourself; if you have questions, feel free to send a comment.

 

Craigslist Scrapy Tutorial on GitHub

You can also find all the spiders we explained in this Python Scrapy tutorial on GitHub (Craigslist Scraper).

Scrapy Comprehensive Course

This tutorial is part of our comprehensive online course, Scrapy, Powerful Web Scraping & Crawling with Python – get 90% OFF using this coupon.

 

Rating: 4.9/5. From 21 votes.
Please wait...

35 Replies to “Scrapy Tutorial: Web Scraping Craigslist”

  1. Hello! Firstly, thank you so, so much for this. I’m a grad student in Economics and I want to research discriminatory signals in job ads that require the applicant to send a photo. I have very little programming experience, but I am able to follow these directions to accomplish pretty much everything I need, except one thing: I am not able to capture the body of the post (in my case, the description of the job that the hiring body has written) in the .csv file. I’ve tried a few different arrangements, but I can’t get it to work. The date, url, and title are all saving perfectly, but the column titled Description remains unpopulated. Have you ever had this problem before? Any help is appreciated!

    Rating: 5.0/5. From 3 votes.
    Please wait...
  2. I just left a comment about not being able to see the description but scratch that! I re-installed it and now it’s populating. THANK you again very much! I’ll credit this tutorial in my paper.

    No votes yet.
    Please wait...
  3. Super awesome tutorial man! Thanks a lot 🙂

    No votes yet.
    Please wait...
    1. You are very welcome! 🙂

      No votes yet.
      Please wait...
  4. Hey everything else is working okay, but I the csv file that is being created doesn’t contain anything using this code

    Rating: 5.0/5. From 1 vote.
    Please wait...
    1. Hi Nik! Do you receive any error messages? Do you have the latest version of Scrapy?

      No votes yet.
      Please wait...
      1. Another question: Do you get the data in the Terminal? If not, are you able to open the website from a regular browser? If not, maybe you are blocked and you have to wait until you get unblocked or change your IP. Please let me know.

        No votes yet.
        Please wait...
  5. Does this code work in December 2017? I have tried using it but I am getting blank csv file

    Rating: 5.0/5. From 1 vote.
    Please wait...
    1. Hi Nik! Yes, I have tested it today just now, and it works fine. (See my reply to your other comment.)

      No votes yet.
      Please wait...
    2. Hey,
      Thank for the speedy reply. I installed everything today so it most likely is. However, I am getting this warning “You do not have a working installation of the service_identity module: ‘cannot import name opentype’. Please install it from ”
      The above error is displayed when I enter the command “scrapy genspider jobs https://newyork.craigslist.org/search/egr
      Great Tutorial by the way

      No votes yet.
      Please wait...
      1. Hi Nik! So please install it:
        pip install service_identity

        No votes yet.
        Please wait...
  6. I can follow through till extracting the titles, but on storing in csv, it is showing some random chinese input. I have even tried running your github code. Do you have any idea why it may be?

    No votes yet.
    Please wait...
    1. Hi Hitesh! Does the website originally have a non-Latin language or accented characters? Feel free to share the website link to have a look (I will not publish it.)

      No votes yet.
      Please wait...
  7. Hi,

    Just wanted to thank you for this excellent tutorial! It helped me to get a decent understanding of the scrapy basics. So thanks a lot!

    Cheers,

    Kerem

    Rating: 5.0/5. From 2 votes.
    Please wait...
    1. You are very welcome, Kerem! All the best!

      Rating: 3.0/5. From 2 votes.
      Please wait...
  8. Enjoyed reading through this, very good stuff, thankyou .

    No votes yet.
    Please wait...
    1. You are very welcome! Thanks!

      No votes yet.
      Please wait...
  9. Hi

    For the third example this the code I have: https://www.pastiebin.com/5a95360ab840c

    But its only fetching 120 results from the first page. It is not fetching results from the next page.

    No votes yet.
    Please wait...
  10. I’ve solved the issue.
    This works.
    https://www.pastiebin.com/5a95401e9ea67

    No votes yet.
    Please wait...
  11. Can scrapy extract Phone numbers from each ad?

    No votes yet.
    Please wait...
    1. Hi Anne! I see some ads have “show contact info”; by clicking it, it shows a phone number. If this is what you mean, you need Selenium to click the button; then you can extract the phone number. Try and let me know if you have more questions.

      No votes yet.
      Please wait...
  12. Hi,
    I start to learn how to use Scrapy to crawl web content and with this tutorial, when I try to crawl at the first step (just the title) I got an error called: “twisted.internet.error.DNSLookupError”
    and I open “https://newyork.craigslist.org/search/egr/” fine with browsers (both Firefox and Chrome)
    I’ve searched for a while but nowhere specifies this problem. Please help me with this. Thanks in advances

    No votes yet.
    Please wait...
    1. Hello Hai! What happens when you type the following in your Terminal?
      scrapy shell "https://newyork.craigslist.org/search/egr/"

      No votes yet.
      Please wait...
  13. Can you tell me how to scrape the email address along with the titles?

    No votes yet.
    Please wait...
    1. Nikeshh, you need Selenium to click the “Reply” button to be able to extract the email address from the box that will appear.

      No votes yet.
      Please wait...
  14. When I try to mine the titles on the next page, I get this error:
    ERROR: Spider must return Request, BaseItem, dict or None, got ‘set’ in >
    For which I only receive the first 121 titles of the first page, not including the second page.

    how I solve this.

    thanks

    No votes yet.
    Please wait...
    1. Antonio, could you please show your code.

      No votes yet.
      Please wait...
  15. Hello all
    Could you give me an advice, please?
    I did everything like the tutorial above says.
    But after a command:
    “scrapy crawl jobs” in PowerShell

    I got this troubleshot message with no outcome.
    Thanks a lot for each tip.
    John

    PS D:\plocha2\pylab\scrapy\ntk> scrapy crawl jobs
    2018-07-31 12:56:00 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: ntk)
    2018-07-31 12:56:00 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
    2018-07-31 12:56:00 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘ntk’, ‘NEWSPIDER_MODULE’: ‘ntk.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘SPIDER_MODULES’: [‘ntk.spiders’]}
    2018-07-31 12:56:00 [scrapy.middleware] INFO: Enabled extensions:
    [‘scrapy.extensions.corestats.CoreStats’,
    ‘scrapy.extensions.telnet.TelnetConsole’,
    ‘scrapy.extensions.logstats.LogStats’]
    Unhandled error in Deferred:
    2018-07-31 12:56:00 [twisted] CRITICAL: Unhandled error in Deferred:

    2018-07-31 12:56:00 [twisted] CRITICAL:
    Traceback (most recent call last):
    File “d:\install\python\lib\site-packages\twisted\internet\defer.py”, line 1418, in _inlineCallbacks
    result = g.send(result)
    File “d:\install\python\lib\site-packages\scrapy\crawler.py”, line 80, in crawl
    self.engine = self._create_engine()
    File “d:\install\python\lib\site-packages\scrapy\crawler.py”, line 105, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
    File “d:\install\python\lib\site-packages\scrapy\core\engine.py”, line 69, in __init__
    self.downloader = downloader_cls(crawler)
    File “d:\install\python\lib\site-packages\scrapy\core\downloader\__init__.py”, line 88, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
    File “d:\install\python\lib\site-packages\scrapy\middleware.py”, line 58, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
    File “d:\install\python\lib\site-packages\scrapy\middleware.py”, line 34, in from_settings
    mwcls = load_object(clspath)
    File “d:\install\python\lib\site-packages\scrapy\utils\misc.py”, line 44, in load_object
    mod = import_module(module)
    File “d:\install\python\lib\importlib\__init__.py”, line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
    File “”, line 994, in _gcd_import
    File “”, line 971, in _find_and_load
    File “”, line 955, in _find_and_load_unlocked
    File “”, line 665, in _load_unlocked
    File “”, line 678, in exec_module
    File “”, line 219, in _call_with_frames_removed
    File “d:\install\python\lib\site-packages\scrapy\downloadermiddlewares\retry.py”, line 20, in
    from twisted.web.client import ResponseFailed
    File “d:\install\python\lib\site-packages\twisted\web\client.py”, line 41, in
    from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
    File “d:\install\python\lib\site-packages\twisted\internet\endpoints.py”, line 41, in
    from twisted.internet.stdio import StandardIO, PipeAddress
    File “d:\install\python\lib\site-packages\twisted\internet\stdio.py”, line 30, in
    from twisted.internet import _win32stdio
    File “d:\install\python\lib\site-packages\twisted\internet\_win32stdio.py”, line 9, in
    import win32api
    ModuleNotFoundError: No module named ‘win32api’
    PS D:\plocha2\pylab\scrapy\ntk>

    No votes yet.
    Please wait...
    1. John, to solve the error “No module named win2api”, you have to install it like this:

      pip install pypiwin32

      Rating: 1.0/5. From 1 vote.
      Please wait...
  16. Good tutorial however, I want to know if scrapy can actually go down a second level from the start url. Say for example I want to get the email for replying to a job ad which craiglists has intentionally blocked from non-residents of a particular geolocation. Thanks in advance.

    No votes yet.
    Please wait...
    1. Shexton, in this case, you will need to use a proxy IP from that location.

      No votes yet.
      Please wait...
  17. when i am trying to scrap the data from wikipedia, it is showing an error “10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond”.How can i resolve this issue?

    No votes yet.
    Please wait...
    1. Hi Indhu! Maybe you got blocked; try to use time.sleep() to add some time between requests. If this does not help, you might need to add proxies.

      No votes yet.
      Please wait...
  18. Can you help me with scraping AJAX pages

    No votes yet.
    Please wait...
    1. Hi Harry! You can use Selenium alone or with Scrapy. Check this Selenium tutorial.

      No votes yet.
      Please wait...

Leave a Reply