News plays an essential role in our daily life. Whether you want to create your own news website, or carry out a data analysis project, there is often a need to fetch different types of news articles or headlines to aggregate the news from different sources at one place or analyze them. Applications can be many, and fortunately, there is a way to retrieve the news articles from the web, from different sources and the same time.
In this tutorial you will learn how to extract news headlines and articles using the News API and save them to a CSV file.
Prerequisites
Get API key
Firstly, get the API key by registering on the News API web page. Be sure to register as an individual. This will grant the free usage of the API. Click here and fill out the form. The API key should be in your inbox shorty.
Install Libraries
Use the Requests Python library to interact with the APIs. To install Requests, fire this command from the Terminal/Command Prompt.
1 2 |
pip install requests |
Apart from Requests, use the Pandas library to save the articles in a CSV file. To install Pandas type this command on the Terminal/Command Prompt.
1 2 |
pip install pandas |
News API Project
Import Libraries
Be sure to import the following libraries in your code:
1 2 3 4 |
import requests import json import pandas |
API Key
Note that in order to interact with the API, it is mandatory to provide the API key.
There are two ways to do that –
1. Provide the API key in the request URL (e.g. https://newsapi.org/v2/everything?q=amazon&apiKey=YOUR_API_KEY)
2. Provide the API key as a header while making the request (we will use this approach here).
Headers
It is not a good practice to include any authentication related information in the request URL. Include it in the headers of the request. Create a dictionary to hold the header parameters.
1 2 |
headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf77s'} |
API Endpoints
After setting the header information, create variables to hold the API endpoints. Something like this –
1 2 3 4 |
top_headlines_url = 'https://newsapi.org/v2/top-headlines' everything_news_url = 'https://newsapi.org/v2/everything' sources_url = 'https://newsapi.org/v2/sources' |
Payloads
Next step is to create different payloads that need to be sent to the API. Payload is nothing but some information that needs to be sent to the API. The payload can include information such as news category, country, language, etc. This payload forms a part of the request. Create dictionaries to hold the payload information –
1 2 3 4 |
headlines_payload = {'category': 'business', 'country': 'us'} everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'} sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'} |
In everything_payload , the parameter ‘q’ stands for the keyword search value. The news articles matching the keyword search are returned as a response. The ‘sortBy’ parameter contains the option that applies to the returned articles. Based on this value, the articles are sorted in the response. More information about the request parameters for the different endpoints can be found here.
Requests
Now make a request to the API.
To get the top headlines:
1 2 |
response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload) |
Make the request using the get() method of the requests library. In the ‘url’ parameter, specify the API endpoint that needs to be hit. In the ‘headers’ parameter, mention the name of the dictionary that contains the header information. Pass the payload dictionary to the ‘params’ parameter. Collect the response in a variable. The response contains the status code of the response as well as the response body.
To get the news articles:
1 2 |
response = requests.get(url=everything_news_url, headers=headers, params=everything_payload) |
The structure of the request remains the same. The ‘headers’ parameter will remain the same throughout. Update the ‘url’ and the ‘params’ parameters. The news articles are returned based on the request parameters.
Just as retrieving the sources of the news, you can use one of these sources to obtain news from a particular source only.
1 2 |
response = requests.get(url=sources_url, headers=headers, params=sources_payload) |
This request returns the news sources available to the API. Start off by making any of the requests mentioned earlier.
Response
To print the response on your console –
1 2 3 |
pretty_json_output = json.dumps(response.json(), indent=4) print(pretty_json_output) |
Note the second parameter of json.dumps() . It specifies the indentation value of the JSON response body. If this value is not specified, the returned JSON response is in a single line which is hard to read. To view the JSON response in a human readable form, provide this parameter. The output looks something like this:
Save Response to CSV
More often than not, the JSON response needs to be saved in a CSV for further processing. Save all the meaningful information out of the JSON response in a CSV. To do so, follow these steps:
Convert the response to a pureJSON string format.
1 2 |
response_json_string = json.dumps(response.json()) |
Load the JSON response in a Python dictionary for further processing. A JSON object is equivalent to a dictionary in Python.
1 2 |
response_dict = json.loads(response_json_string) |
The response contains different objects. Out of these objects, only the articles related information is relevant to us. In the response, a json array called ‘articles’ contains this information . A json array is equivalent to a list in Python. Extract the ‘articles’ array from the response in a variable.
1 2 |
articles_list = response_dict['articles'] |
For more information on the response values click here.
Next, convert the ‘articles_list’ to a json string and then convert that json string to a data frame. Write the data frame to a csv. The data frame is data structure that is a part of the Pandas library. Any sort of data can be stored in a data frame.
1 2 |
df = pandas.read_json(json.dumps(articles_list)) |
Next, write the dataframe to a csv.
1 2 |
df.to_csv('/Users/appleapple/Desktop/news.csv') |
The CSV looks something like this:
Complete Project Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
import requests import json import pandas # The headers remain the same for all the requests headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf97d'} # All the endpoints in this section # To fetch the top headlines top_headlines_url = 'https://newsapi.org/v2/top-headlines' # To fetch news articles everything_news_url = 'https://newsapi.org/v2/everything' # To retrieve the sources sources_url = 'https://newsapi.org/v2/sources' # Add parameters to request URL based on what type of headlines news you want # All the payloads in this section headlines_payload = {'category': 'business', 'country': 'us'} everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'} sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'} # Fire a request based on the requirement, just change the url and the params field # Request to fetch the top headlines # response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload) # Request to fetch every news article response = requests.get(url=everything_news_url, headers=headers, params=everything_payload) # Request to fetch the sources # response = requests.get(url=sources_url, headers=headers, params=sources_payload) # If you just want to print pretty_json_output = json.dumps(response.json(), indent=4) print(pretty_json_output) # print(response.json()) # To store the relevant json data to a csv # Convert response to a pure json string response_json_string = json.dumps(response.json()) # A json object is equivalent to a dictionary in Python # retrieve json objects to a python dict response_dict = json.loads(response_json_string) print(response_dict) # Info about articles is represented as an array in the json response # A json array is equivalent to a list in python # We want info only about articles articles_list = response_dict['articles'] # We want info only about sources # sources_list = response_dict['sources'] # And then you can specify one of these sources explicitly if you like while fetching the news # Convert articles list to json string , convert json string to dataframe , write df to csv! df = pandas.read_json(json.dumps(articles_list)) # Convert sources list to json string , convert json string to dataframe , write df to csv! # df = pandas.read_json(json.dumps(sources_list)) # Using Pandas write the json data to a csv df.to_csv('/Users/appleapple/Desktop/news.csv') |
Course: REST API: Data Extraction with Python
Working with APIs is a skill requested for many jobs. Why?
APIs is the official way for data extraction and doing other automation stuff allowed by big websites. If there is an API allowing you to extract the data you need from a website, then you do not need regular web scraping.
Join our new course, REST APIs: Data Extraction and Automation with Python, for 90% OFF using this coupon:
Worked for Accenture as Software Engineer, and currently pursuing my Master’s degree in Data Science.
Very Nice Article