There are several Web Scraping best practices you have to follow. Among them are critical questions you have to ask yourself beforehand.
1. Is there an API?
Before web scraping, it is highly recommended to search for an API for the website you want to get data from. Most large websites offer APIs to make data extraction a better experience for both parties. So try first to search Google for an API for the website; if you find one, you do not need to scrape it. APIs generate JSON objects which are very similar to Python dictionaries, and from which data can be extracted using the Python JSON library. Read this tutorial to learn how to extract data from APIs.
2. Is it legal?
Before web scraping, it is highly recommended you read the Terms and Conditions of the website. Some websites clearly mention prohibiting web scraping without permission, or mention some legal or copyright aspects related to the use of its data.
Employ common sense! Some web scraping or other robot activities are obviously illegal if they cause any direct or indirect damage to the company owning data or its customers. It is a good idea to discuss the purpose of a web scraping project with your client before accepting it.
3. Is it harmful?
Before web scraping, prepare your code to be “polite”: do not unnecessarily disable robots.txt of the website; space out your requests a bit so that you do not hammer the site’s server; and it is better to run your spiders during off-peak traffic hours of the website.
If you are using Scrapy, consider activating the option DOWNLOAD_DELAY in the settings.py file. Otherwise, you can use the Python time.sleep() method, which can be used along with the random.randrange() method.
Do you have questions? Please comment below.