In this Scrapy tutorial, you will learn how to write a Craigslist crawler to scrape Craigslist‘s “Architecture & Engineering” jobs in New York and store the data to a CSV file.
This tutorial is one lecture of our comprehensive Scrapy online course on Udemy, Scrapy: Powerful Web Scraping & Crawling with Python
Scrapy Tutorial Getting Started
As you may already know, Scrapy is one of the most popular and powerful Python scraping frameworks. In this Scrapy tutorial we will explain how to use it on a real-life project, step by step.
Scrapy Installation
You can simply install Scrapy using pip with the following command:
1 |
$ pip install scrapy |
If you are on Linux or Mac, you might need to start the command with sudo as follows:
1 |
$ sudo pip install scrapy |
This will install all the dependencies as well.
Creating a Scrapy Project
Now, you need to create a Scrapy project. In your Terminal/CMD navigate to the folder in which you want to save your project and then run the following command:
1 |
$ scrapy startproject craigslist |
Note that for all Scrapy commands, we start with scrapy followed by the command which is here startproject for creating a new Scrapy project and here it should be followed by the project name which can be anything; we called it “craigslist” just for example.
So typically any Scrapy command will look like this:
1 |
$ scrapy <command> [options] [args] |
Creating a Scrapy Spider
In your Terminal, navigate to the folder of the Scrapy project we have created in the previous step. As we called it craigslist, the folder would be with the same name and the command should simply be:
1 |
$ cd craigslist |
After that, create the spider using the genspider command and give it any name you like; here we will call it jobs. Then, it should be followed by the URL.
1 |
$ scrapy genspider jobs https://newyork.craigslist.org/search/egr |
Editing the Scrapy Spider
Now, manually navigate to the craigslist folder, i.e. the scrapy project. You can see that it is structured as in this screenshot.
For now, we will be concentrating on the spider file, which is here called jobs.py – open it in any text editor and you will have the following:
1 2 3 4 5 6 7 8 9 10 11 |
# -*- coding: utf-8 -*- import scrapy class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = ['https://newyork.craigslist.org/search/egr'] def parse(self, response): pass |
Understanding the Scrapy Spider File
Let’s check the parts of the main class of the file automatically generated for our jobs Scrapy spider:
1- name of the spider.
2- allowed_domains the list of the domains that the spider is allowed scrape.
3- start_urls the list of one or more URL(s) with which the spider starts crawling.
4- parse the main function of the spider. Do NOT change its name; however, you may add extra functions if needed.
Warning: Scrapy adds extra http:// at the beginning of the URL in start_urls and it also adds a trailing slash. As we here already added https:// while creating the spider, we must delete the extra http://. So double-check that the URL(s) in start_urls are correct or the spider will not work.
Craigslist Scrapy Spider #1 – Titles
If you are new to Scrapy, let’s start by extracting and retrieving only one element for the sake of clarification. We are just starting with this basic spider as a foundation for more sophisticated spiders in this Scrapy tutorial. This simple spider will only extract job titles.
Editing the parse() Function
Instead of pass, add this line to the parse() function:
1 2 |
titles = response.xpath('//a[@class="result-title hdrlnk"]/text()').extract() |
What does this mean?
titles is a [list] of text portions extracted based on a rule.
response is simply the whole html source code retrieved from the page. Actually, “response” has a deeper meaning because if you print(response) you will get something like <200 https://newyork.craigslist.org/search/egr> which means “you have managed to connect to this web page”; however, if you print(response.body) you will get the whole source code. Anyhow, when you use XPath expressions to extract HTML nodes, you should directly use response.xpath()
xpath is how we will extract portions of text and it has rules. XPath is a detailed topic and we will dedicate a separate article for it. But generally try to notice the following:
Open the URL in your browser, move the cursor on any job title, right-click, and select “Inspect“. You can see now the HTML code like this:
1 |
<a href="/brk/egr/6085878649.html" data-id="6085878649" class="result-title hdrlnk">Chief Engineer</a> |
So, you want to extract “Chief Engineer” which is the text of an <a> tag, and as you can see this <a> tag has the class “result-title hdrlnk” which can distinguish it from other <a> tags on the web-page.
Let’s explain the XPath rule we have:
// means instead of starting from the <html>, just start from the tag that I will specify after it.
/a simply refers to the <a> tag.
[@class="result-title hdrlnk"] that is directly comes after /a means the <a> tag must have this class name in it.
text() refers to the text of the <a> tag, which is”Chief Engineer”.
extract() means extract every instance on the web page that follows the same XPath rule into a [list].
extract_first() if you use it instead of extract() it will extract only the first item in the list.
Now, you can print titles:
1 2 |
print(titles) |
Your “basic Scrapy spider” code should now look like this:
1 2 3 4 5 6 7 8 9 10 11 12 |
# -*- coding: utf-8 -*- import scrapy class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = ['https://newyork.craigslist.org/search/egr'] def parse(self, response): titles = response.xpath('//a[@class="result-title hdrlnk"]/text()').extract() print(titles) |
Running the Scrapy Spider
Now, move to your Terminal, make sure you are in the root directory of your Scrapy project craigslist and run the spider using the following command:
1 |
$ scrapy crawl jobs |
In Terminal, you will get a similar result; it is a [list] including the job titles (the titles can vary from day to day):
[u'Junior/ Mid-Level Architect for Immediate Hire', u'SE BUSCA LLANTERO/ LOOKING FOR TIRE CAR WORKER CON EXPERIENCIA', u'Draftsperson/Detailer', u'Controls/ Instrumentation Engineer', u'Project Manager', u'Tunnel Inspectors - Must be willing to Relocate to Los Angeles', u'Senior Designer - Large Scale', u'Construction Estimator/Project Manager', u'CAD Draftsman/Estimator', u'Project Manager']
As you can see, the result is a list of Unicode strings. So you can loop on them, and yield one title per time in a form of dictionary.
1 2 3 |
for title in titles: yield {'Title': title} |
Your “basic Scrapy spider” code should now look like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
# -*- coding: utf-8 -*- import scrapy class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = ['https://newyork.craigslist.org/search/egr'] def parse(self, response): titles = response.xpath('//a[@class="result-title hdrlnk"]/text()').extract() for title in titles: yield {'Title': title} |
Storing the Scraped Data to CSV
You can now run your spider and store the output data into CSV, JSON or XML. To store the data into CSV, run the following command in Terminal. The result will be a CSV file called result-titles.csv in your Scrapy spider directory.
1 |
scrapy crawl jobs -o result-titles.csv |
Your Terminal should show you a similar result, which indicates success
1 2 3 4 5 6 |
'downloader/response_status_count/200': 2, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 5, 2, 17, 26, 30, 348412), 'item_scraped_count': 120, 'log_count/DEBUG': 123, 'log_count/INFO': 8, |
'item_scraped_count' refers to the number of titles scraped from the page. 'log_count/DEBUG' and 'log_count/INFO' are okay; however, if you received 'log_count/ERROR' you should find out which errors you get during scraping are fix your code.
In Terminal, you will notice debug messages like:
1 |
DEBUG: Scraped from <200 https://newyork.craigslist.org/search/egr> |
The status code 200 means the request has succeeded. Also, 'downloader/response_status_count/200' tells you how many requests succeeded. There are many other status codes with different meanings; however, in web scraping they could act as a defense mechanism against web scraping.
Craigslist Scrapy Spider #2 – One Page
In the second part of this Scrapy tutorial, we will scrape the details of Craigslist’s “Architecture & Engineering” jobs in New York. For now, you will start by only one page. In the third part of the tutorial, you will learn how to navigate to next pages.
Before starting this Scrapy exercise, it is very important to understand the main approach:
The Secret: Wrapper
In the first part of this Scrapy tutorial, we extracted titles only. However, if you want to scrape several details about each job, you will not extract them separately, and then loop on each of them. No! Actually, you scrape the whole “container” or “wrapper” of each job including all the information you need, and then extract pieces of information from each container/wrapper.
To see how this container/wrapper looks like, right-click any job on the Craigslist’s page and select “Inspect”; you will see this:
As you can see, each result is inside an HTML list <li> tag.
If you expand the <li> tag, you will see this HTML code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
<li class="result-row" data-pid="6112478644"> <a href="/brk/egr/6112478644.html" class="result-image gallery empty"></a> <p class="result-info"> <span class="icon icon-star" role="button"> <span class="screen-reader-text">favorite this post</span> </span> <time class="result-date" datetime="2017-05-01 12:35" title="Mon 01 May 12:35:41 PM">May 1</time> <a href="/brk/egr/6112478644.html" data-id="6112478644" class="result-title hdrlnk">Project Architect</a> <span class="result-meta"> <span class="result-hood"> (Brooklyn)</span> <span class="result-tags"> <span class="maptag" data-pid="6112478644">map</span> </span> <span class="banish icon icon-trash" role="button"> <span class="screen-reader-text">hide this posting</span> </span> <span class="unbanish icon icon-trash red" role="button" aria-hidden="true"></span> <a href="#" class="restore-link"> <span class="restore-narrow-text">restore</span> <span class="restore-wide-text">restore this posting</span> </a> </span> </p> </li> |
As you can see, the <li> tag includes all the information you need; so you can consider it your “wrapper”. Actually, you can even start from the <p> tag which includes the same information you need. The <li> is distinguished by the class “result-row” while the <p> is distinguished by the class “result-info” so each of them is unique, and can be easily distinguished by your XPath expression.
Note: If you are using the same spider from the Basic Scrapy Spider, delete any code under the parse() function, and start over, or just copy the file into the same “spiders” folder and change name = "jobs" in the basic spider to anything else like name = "jobs-titles" and keep the new one name = "jobs" as is.
Extracting All Wrappers
As we agreed, you first need to scrape all the wrappers from the page. So under the parse() function, write the following:
1 2 |
jobs = response.xpath('//p[@class="result-info"]') |
Note that here you will not use extract() because it is the wrapper from which you will extract other HTML nodes.
Extracting Job Titles
You can extract the job titles from the wrappers using a for loop as follows:
1 2 3 4 5 |
for job in jobs: title = job.xpath('a/text()').extract_first() yield{'Title':title} |
The first observation is that in the for loop, you do not use “response” (which you already used to extract the wrapper). Instead, you use the wrapper selector which are referred to as “job”.
Also, as you can see, we started the XPath expression of “jobs” by // meaning it starts from <html> until this <p> whose class name is “result-info”.
However, we started the XPath expression of “title” without any slashes, because it complements or depends on the XPath expression of the job wrapper. If you rather want to use slashes, you will have to precede it with a dot to refer to the current node as follows:
1 2 |
title = job.xpath('.//a/text()').extract_first() |
As we explained in the first part of this Scrapy tutorial, a refers to the first <a> tag inside the <p> tag, and text() refers to the text inside the <a> tag which is the job title.
Here, we are using extract_first() because in each iteration of the loop, we are in a wrapper with only one job.
What is the difference between the above code and what we had in the first part of the tutorial? So far, this will give the same result. However, as you want to extract and yield more than one element from the wrapper, this approach is more straightforward because now, you can extract other elements like the address and URL of each job.
Extracting Job Addresses and URLs
You can extract the job address and URL from the wrappers using the same for loop as follows:
1 2 3 4 5 6 7 8 |
for job in jobs: title = job.xpath('a/text()').extract_first() address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1] relative_url = job.xpath('a/@href').extract_first() absolute_url = response.urljoin(relative_url) yield{'URL':absolute_url, 'Title':title, 'Address':address} |
To extract the job address, you refer to the <span> tag whose class name is “result-meta” and then the <span> tag whose class name is “result-hood” and then the text() in it. The address is between brackets like (Brooklyn); so if you want to delete them, you can use string slicing [2:-1]. However, this string slicing will not work if there is no address (which is the case for some jobs) because the value will be None which is not a string! So you have to add empty quotes inside extract_first("") which means if there is no result, the result is “”.
To extract the job URL, you refer to the <a> tag and the value of the href attribute, which is the URL. Yes, @ means an attribute.
However, this is a relative URL, which looks like: /brk/egr/6112478644.html so, to get the absolute URL to be able to use it later, you can either use Python concatenation as follows:
1 2 |
absolute_url = "https://newyork.craigslist.org" + relative_url |
Note: Concatenation is another case in which you need to add quotes to extract_first("") because concatenation works only on strings and cannot work on None values.
Otherwise, you can simply use the urljoin() method, which builds a full absolute URL:
1 2 |
absolute_url = response.urljoin(relative_url) |
Finally, yield your data using a dictionary:
1 2 |
yield{'URL':absolute_url, 'Title':title, 'Address':address} |
Running the Spider and Storing Data
Just as we did in the first part of this Scrapy tutorial, you can use this command to run your spider and store the scraped data to a CSV file.
1 |
$ scrapy crawl jobs -o result-jobs-one-page.csv |
Craigslist Scrapy Spider #3 – Multiple Pages
To do the same for all the result pages of Craigslist’s Architecture & Engineering jobs, you need to extract the “next” URLs and then apply the same parse() function on them.
Extracting Next URLs
To extract the “next” URLs, right-click the one in the first page, and “Inspect” it. Here is how the HTML code looks like:
1 2 |
<a href="/search/egr?s=120" class="button next" title="next page">next > </a> |
So here is the code you need. Be careful! You should now get out of the for loop.
1 2 3 4 5 |
relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first() absolute_next_url = response.urljoin(relative_next_url) yield Request(absolute_next_url, callback=self.parse) |
Note that the next URL is in an <a> tag whose class name is “button next”. You need to extract the value of the attribute href so that is why you should use @href
Just as you did in the previous part of this Scrapy tutorial, you need to extract the absolute next url using the urljoin() method.
Finally, yield the Request() method with the absolute_next_url and this requires a callback function, which means a function to apply on this URL; in this case, it is the same parse() function which extracts the titles, addresses and URLs of jobs from each page. Note that in the case the parse() function is the callback, you can delete this callback=self.parse because the parse() function is a callback by default, even if you do not explicitly state that.
Of course, you must import the Request method before using it:
1 2 |
from scrapy import Request |
Running the Spider and Storing Data
Again, just as we did in the first and second parts of this Scrapy tutorial, you can run your spider and save the scraped data to a CSV file using this command:
1 |
$ scrapy crawl jobs -o result-jobs-multi-pages.csv |
Craigslist Scrapy Spider #4 – Job Descriptions
In this spider, we will open the URL of each job and scrape its description. The code we wrote so far will change; so if you like, you can copy the file into the same “spiders” folder, and change the spider name of the previous one to be something like name = "jobsall" and you can keep the new file name = "jobs" as is.
Passing the Data to a Second Function
In the parse() function, we had the following yield in the for loop:
1 2 |
yield{'URL':absolute_url, 'Title':title, 'Address':address} |
We yielded the data at this stage because we were scraping the details of each job from one page, but now you need to create a new function called for example parse_page() to open the URL of each job and scrape the job description from each job dedicated page; so you have to pass the URL of the job from the parse() function to the parse_page() function in a callback using the Request() method as follows:
1 2 |
yield Request(absolute_url, callback=self.parse_page) |
What about the data you have already extracted? You need to pass those values of titles and addresses from the parse() function to the parse_page() function as well, using meta in a dictionary as follows:
1 2 |
yield Request(absolute_url, callback=self.parse_page, meta={'URL': absolute_url, 'Title': title, 'Address':address}) |
Function to Extract Job Descriptions
So the parse_page() function will be as follows:
1 2 3 4 5 6 7 8 9 |
def parse_page(self, response): url = response.meta.get('URL') title = response.meta.get('Title') address = response.meta.get('Address') description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract()) yield{'URL': url, 'Title': title, 'Address':address, 'Description':description} |
Here, you should deal with meta as a dictionary and assign each value to a new variable. To get the value of each key in the dictionary, you can use the Python’s dictionary get() method as usual.
Otherwise, you can use the other way around by adding the new value “description” to the meta dictionary and then simply yield meta to retrieve all the data as follows:
1 2 3 4 5 |
def parse_page(self, response): response.meta['Description'] = description yield response.meta |
Note that a job description might be in more than one paragraph; so we used join() to merge them.
Now, we can also extract “compensation” and “employment type”. They are in a <p> tag whose class name is “attrgroup” and then each of them is in a <span> tag, but with no class or id to distinguish them.
1 2 3 4 5 |
<p class="attrgroup"> <span>compensation: <b>To 110K, DOE</b></span><br> <span>employment type: <b>full-time</b></span><br> </p> |
So we can use span[1] and span[2] to refer to the first <span> tag of “compensation” and the second <span> tag of “employment type” respectively.
1 2 3 |
compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first() employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first() |
Actually, this is the same as if you use list slicing. In this case, you will use extract() not extract_first() and follow the expression with [0] or [1] which can be added directly after the expression as shown below or even after extract(). Note that in the previous way, we counted the tags from [1] while in this way, we are counting from [0] just as any list.
1 2 3 |
compensation = response.xpath('//p[@class="attrgroup"]/span/b/text()')[0].extract() employment_type = response.xpath('//p[@class="attrgroup"]/span/b/text()')[1].extract() |
Storing Scrapy Output Data to CSV, XML or JSON
You can run the spider and save the scraped data to either CSV, XML or JSON as follows:
1 |
$ scrapy crawl jobs -o result-jobs-multi-pages-content.csv |
1 |
$ scrapy crawl jobs -o result-jobs-multi-pages-content.xml |
1 |
$ scrapy crawl jobs -o result-jobs-multi-pages-content.json |
Editing Scrapy settings.py
In the same Scrapy project folder, you can find a file called settings.py which includes several options you can use. Let’s explain a couple of regularly used options. Note that sometimes the option is already in the file and you have just to uncomment it, and sometimes you have to add it yourself.
Arranging Scrapy Output CSV Columns
As you might have noticed by now, Scrapy does not arrange the CSV columns in the same order of your dictionary.
Open the file settings.py found in the Scrapy project folder and add this line; it will give you the order you need:
1 2 |
FEED_EXPORT_FIELDS = ['Title','URL', 'Address', 'Compensation', 'Employment Type','Description'] |
Note: If your run the command of generating the CSV again, make sure you first delete the current CSV file or it will add the new output at the end of the current one.
Setting a User Agent
If you are using a regular browser to navigate a website, your web browser will send what is known as a “User Agent” for every page you access. So it is recommended to use the Scrapy USER_AGENT option while web scraping to make it look more natural. The option is already in Scrapy settings.py so you can enable it by deleting the # sign.
You can find lists of user agents online; select a recent one for Chrome or Firefox. You can even check your own user agent; however, it is not required to use the same one in Scrapy.
Here is an example:
1 2 |
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:40.0) Gecko/20100101 Firefox/40.1' |
Setting Scrapy DOWNLOAD_DELAY
The option DOWNLOAD_DELAY is already there in Scrapy settings.py so you can just enable it by deleting the # sign. According to Scrapy documentation, “this can be used to throttle the crawling speed to avoid hitting servers too hard.”
You can see that the offered number is 3 seconds; however, you can make them less or more. Still, this makes sense because there is another option that is activated by default which is RANDOMIZE_DOWNLOAD_DELAY and it is set from 0.5 to 1.5 seconds.
Final Scrapy Tutorial Spider Code
So the whole code of this Scrapy tutorial is as follows. Try it yourself; if you have questions, feel free to send a comment.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
import scrapy from scrapy import Request class JobsSpider(scrapy.Spider): name = "jobs" allowed_domains = ["craigslist.org"] start_urls = ["https://newyork.craigslist.org/search/egr"] def parse(self, response): jobs = response.xpath('//p[@class="result-info"]') for job in jobs: relative_url = job.xpath('a/@href').extract_first() absolute_url = response.urljoin(relative_url) title = job.xpath('a/text()').extract_first() address = job.xpath('span[@class="result-meta"]/span[@class="result-hood"]/text()').extract_first("")[2:-1] yield Request(absolute_url, callback=self.parse_page, meta={'URL': absolute_url, 'Title': title, 'Address':address}) relative_next_url = response.xpath('//a[@class="button next"]/@href').extract_first() absolute_next_url = "https://newyork.craigslist.org" + relative_next_url yield Request(absolute_next_url, callback=self.parse) def parse_page(self, response): url = response.meta.get('URL') title = response.meta.get('Title') address = response.meta.get('Address') description = "".join(line for line in response.xpath('//*[@id="postingbody"]/text()').extract()) compensation = response.xpath('//p[@class="attrgroup"]/span[1]/b/text()').extract_first() employment_type = response.xpath('//p[@class="attrgroup"]/span[2]/b/text()').extract_first() yield{'URL': url, 'Title': title, 'Address':address, 'Description':description, 'Compensation':compensation, 'Employment Type':employment_type} |
Craigslist Scrapy Tutorial on GitHub
You can also find all the spiders we explained in this Python Scrapy tutorial on GitHub (Craigslist Scraper).
Scrapy Comprehensive Course
This tutorial is part of our comprehensive online course, Scrapy, Powerful Web Scraping & Crawling with Python – get 90% OFF using this coupon.
Hello! Firstly, thank you so, so much for this. I’m a grad student in Economics and I want to research discriminatory signals in job ads that require the applicant to send a photo. I have very little programming experience, but I am able to follow these directions to accomplish pretty much everything I need, except one thing: I am not able to capture the body of the post (in my case, the description of the job that the hiring body has written) in the .csv file. I’ve tried a few different arrangements, but I can’t get it to work. The date, url, and title are all saving perfectly, but the column titled Description remains unpopulated. Have you ever had this problem before? Any help is appreciated!
I just left a comment about not being able to see the description but scratch that! I re-installed it and now it’s populating. THANK you again very much! I’ll credit this tutorial in my paper.
Super awesome tutorial man! Thanks a lot 🙂
You are very welcome! 🙂
Hey everything else is working okay, but I the csv file that is being created doesn’t contain anything using this code
Hi Nik! Do you receive any error messages? Do you have the latest version of Scrapy?
Another question: Do you get the data in the Terminal? If not, are you able to open the website from a regular browser? If not, maybe you are blocked and you have to wait until you get unblocked or change your IP. Please let me know.
Does this code work in December 2017? I have tried using it but I am getting blank csv file
Hi Nik! Yes, I have tested it today just now, and it works fine. (See my reply to your other comment.)
Hey,
Thank for the speedy reply. I installed everything today so it most likely is. However, I am getting this warning “You do not have a working installation of the service_identity module: ‘cannot import name opentype’. Please install it from ”
The above error is displayed when I enter the command “scrapy genspider jobs https://newyork.craigslist.org/search/egr”
Great Tutorial by the way
Hi Nik! So please install it:
pip install service_identity
I can follow through till extracting the titles, but on storing in csv, it is showing some random chinese input. I have even tried running your github code. Do you have any idea why it may be?
Hi Hitesh! Does the website originally have a non-Latin language or accented characters? Feel free to share the website link to have a look (I will not publish it.)
Hi,
Just wanted to thank you for this excellent tutorial! It helped me to get a decent understanding of the scrapy basics. So thanks a lot!
Cheers,
Kerem
You are very welcome, Kerem! All the best!
Enjoyed reading through this, very good stuff, thankyou .
You are very welcome! Thanks!
Hi
For the third example this the code I have: https://www.pastiebin.com/5a95360ab840c
But its only fetching 120 results from the first page. It is not fetching results from the next page.
I’ve solved the issue.
This works.
https://www.pastiebin.com/5a95401e9ea67
Can scrapy extract Phone numbers from each ad?
Hi Anne! I see some ads have “show contact info”; by clicking it, it shows a phone number. If this is what you mean, you need Selenium to click the button; then you can extract the phone number. Try and let me know if you have more questions.
Hi,
I start to learn how to use Scrapy to crawl web content and with this tutorial, when I try to crawl at the first step (just the title) I got an error called: “twisted.internet.error.DNSLookupError”
and I open “https://newyork.craigslist.org/search/egr/” fine with browsers (both Firefox and Chrome)
I’ve searched for a while but nowhere specifies this problem. Please help me with this. Thanks in advances
Hello Hai! What happens when you type the following in your Terminal?
scrapy shell "https://newyork.craigslist.org/search/egr/"
Can you tell me how to scrape the email address along with the titles?
Nikeshh, you need Selenium to click the “Reply” button to be able to extract the email address from the box that will appear.
When I try to mine the titles on the next page, I get this error:
ERROR: Spider must return Request, BaseItem, dict or None, got ‘set’ in >
For which I only receive the first 121 titles of the first page, not including the second page.
how I solve this.
thanks
Antonio, could you please show your code.
Hello all
Could you give me an advice, please?
I did everything like the tutorial above says.
But after a command:
“scrapy crawl jobs” in PowerShell
I got this troubleshot message with no outcome.
Thanks a lot for each tip.
John
PS D:\plocha2\pylab\scrapy\ntk> scrapy crawl jobs
2018-07-31 12:56:00 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: ntk)
2018-07-31 12:56:00 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 16:07:46) [MSC v.1900 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0h 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-07-31 12:56:00 [scrapy.crawler] INFO: Overridden settings: {‘BOT_NAME’: ‘ntk’, ‘NEWSPIDER_MODULE’: ‘ntk.spiders’, ‘ROBOTSTXT_OBEY’: True, ‘SPIDER_MODULES’: [‘ntk.spiders’]}
2018-07-31 12:56:00 [scrapy.middleware] INFO: Enabled extensions:
[‘scrapy.extensions.corestats.CoreStats’,
‘scrapy.extensions.telnet.TelnetConsole’,
‘scrapy.extensions.logstats.LogStats’]
Unhandled error in Deferred:
2018-07-31 12:56:00 [twisted] CRITICAL: Unhandled error in Deferred:
2018-07-31 12:56:00 [twisted] CRITICAL:
Traceback (most recent call last):
File “d:\install\python\lib\site-packages\twisted\internet\defer.py”, line 1418, in _inlineCallbacks
result = g.send(result)
File “d:\install\python\lib\site-packages\scrapy\crawler.py”, line 80, in crawl
self.engine = self._create_engine()
File “d:\install\python\lib\site-packages\scrapy\crawler.py”, line 105, in _create_engine
return ExecutionEngine(self, lambda _: self.stop())
File “d:\install\python\lib\site-packages\scrapy\core\engine.py”, line 69, in __init__
self.downloader = downloader_cls(crawler)
File “d:\install\python\lib\site-packages\scrapy\core\downloader\__init__.py”, line 88, in __init__
self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
File “d:\install\python\lib\site-packages\scrapy\middleware.py”, line 58, in from_crawler
return cls.from_settings(crawler.settings, crawler)
File “d:\install\python\lib\site-packages\scrapy\middleware.py”, line 34, in from_settings
mwcls = load_object(clspath)
File “d:\install\python\lib\site-packages\scrapy\utils\misc.py”, line 44, in load_object
mod = import_module(module)
File “d:\install\python\lib\importlib\__init__.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 994, in _gcd_import
File “”, line 971, in _find_and_load
File “”, line 955, in _find_and_load_unlocked
File “”, line 665, in _load_unlocked
File “”, line 678, in exec_module
File “”, line 219, in _call_with_frames_removed
File “d:\install\python\lib\site-packages\scrapy\downloadermiddlewares\retry.py”, line 20, in
from twisted.web.client import ResponseFailed
File “d:\install\python\lib\site-packages\twisted\web\client.py”, line 41, in
from twisted.internet.endpoints import HostnameEndpoint, wrapClientTLS
File “d:\install\python\lib\site-packages\twisted\internet\endpoints.py”, line 41, in
from twisted.internet.stdio import StandardIO, PipeAddress
File “d:\install\python\lib\site-packages\twisted\internet\stdio.py”, line 30, in
from twisted.internet import _win32stdio
File “d:\install\python\lib\site-packages\twisted\internet\_win32stdio.py”, line 9, in
import win32api
ModuleNotFoundError: No module named ‘win32api’
PS D:\plocha2\pylab\scrapy\ntk>
John, to solve the error “No module named win2api”, you have to install it like this:
pip install pypiwin32
Good tutorial however, I want to know if scrapy can actually go down a second level from the start url. Say for example I want to get the email for replying to a job ad which craiglists has intentionally blocked from non-residents of a particular geolocation. Thanks in advance.
Shexton, in this case, you will need to use a proxy IP from that location.
when i am trying to scrap the data from wikipedia, it is showing an error “10060: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond”.How can i resolve this issue?
Hi Indhu! Maybe you got blocked; try to use time.sleep() to add some time between requests. If this does not help, you might need to add proxies.
Can you help me with scraping AJAX pages
Hi Harry! You can use Selenium alone or with Scrapy. Check this Selenium tutorial.