News plays an essential role in our daily life. Whether you want to create your own news website, or carry out a data analysis project, there is often a need to fetch different types of news articles or headlines to aggregate the news from different sources at one place or analyze them. Applications can be many, and fortunately, there is a way to retrieve the news articles from the web, from different sources and the same time.

In this tutorial you will learn how to extract news headlines and articles using the News API and save them to a CSV file.

Tutorial Contents

Prerequisites

Get API key

Firstly, get the API key by registering on the News API web page. Be sure to register as an individual. This will grant the free usage of the API. Click here and fill out the form. The API key should be in your inbox shorty.

Install Libraries

Use the Requests Python library to interact with the APIs. To install Requests, fire this command from the Terminal/Command Prompt.

pip install requests

1 2	pip install requests

Apart from Requests, use the Pandas library to save the articles in a CSV file. To install Pandas type this command on the Terminal/Command Prompt.

pip install pandas

1 2	pip install pandas

News API Project

Import Libraries

Be sure to import the following libraries in your code:

import requests
import json
import pandas

import requests

import json

import pandas

API Key

Note that in order to interact with the API, it is mandatory to provide the API key.

There are two ways to do that –

1. Provide the API key in the request URL (e.g. https://newsapi.org/v2/everything?q=amazon&apiKey=YOUR_API_KEY)

2. Provide the API key as a header while making the request (we will use this approach here).

Headers

It is not a good practice to include any authentication related information in the request URL. Include it in the headers of the request. Create a dictionary to hold the header parameters.

headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf77s'}

1 2	headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf77s'}

API Endpoints

After setting the header information, create variables to hold the API endpoints. Something like this –

top_headlines_url = 'https://newsapi.org/v2/top-headlines'
everything_news_url = 'https://newsapi.org/v2/everything'
sources_url = 'https://newsapi.org/v2/sources'

top_headlines_url = 'https://newsapi.org/v2/top-headlines'

everything_news_url = 'https://newsapi.org/v2/everything'

sources_url = 'https://newsapi.org/v2/sources'

Payloads

Next step is to create different payloads that need to be sent to the API. Payload is nothing but some information that needs to be sent to the API. The payload can include information such as news category, country, language, etc. This payload forms a part of the request. Create dictionaries to hold the payload information –

headlines_payload = {'category': 'business', 'country': 'us'}
everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'}
sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'}

headlines_payload = {'category': 'business', 'country': 'us'}

everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'}

sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'}

In everything_payload , the parameter ‘q’ stands for the keyword search value. The news articles matching the keyword search are returned as a response. The ‘sortBy’ parameter contains the option that applies to the returned articles. Based on this value, the articles are sorted in the response. More information about the request parameters for the different endpoints can be found here.

Requests

Now make a request to the API.

To get the top headlines:

response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload)

1 2	response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload)

Make the request using the get() method of the requests library. In the ‘url’ parameter, specify the API endpoint that needs to be hit. In the ‘headers’ parameter, mention the name of the dictionary that contains the header information. Pass the payload dictionary to the ‘params’ parameter. Collect the response in a variable. The response contains the status code of the response as well as the response body.

To get the news articles:

response = requests.get(url=everything_news_url, headers=headers, params=everything_payload)

1 2	response = requests.get(url=everything_news_url, headers=headers, params=everything_payload)

The structure of the request remains the same. The ‘headers’ parameter will remain the same throughout. Update the ‘url’ and the ‘params’ parameters. The news articles are returned based on the request parameters.

Just as retrieving the sources of the news, you can use one of these sources to obtain news from a particular source only.

response = requests.get(url=sources_url, headers=headers, params=sources_payload)

1 2	response = requests.get(url=sources_url, headers=headers, params=sources_payload)

This request returns the news sources available to the API. Start off by making any of the requests mentioned earlier.

Response

To print the response on your console –

pretty_json_output = json.dumps(response.json(), indent=4)
print(pretty_json_output)

pretty_json_output = json.dumps(response.json(), indent=4)

print(pretty_json_output)

Note the second parameter of json.dumps() . It specifies the indentation value of the JSON response body. If this value is not specified, the returned JSON response is in a single line which is hard to read. To view the JSON response in a human readable form, provide this parameter. The output looks something like this:

Save Response to CSV

More often than not, the JSON response needs to be saved in a CSV for further processing. Save all the meaningful information out of the JSON response in a CSV. To do so, follow these steps:

Convert the response to a pureJSON string format.

response_json_string = json.dumps(response.json())

1 2	response_json_string = json.dumps(response.json())

Load the JSON response in a Python dictionary for further processing. A JSON object is equivalent to a dictionary in Python.

response_dict = json.loads(response_json_string)

1 2	response_dict = json.loads(response_json_string)

The response contains different objects. Out of these objects, only the articles related information is relevant to us. In the response, a json array called ‘articles’ contains this information . A json array is equivalent to a list in Python. Extract the ‘articles’ array from the response in a variable.

articles_list = response_dict['articles']

1 2	articles_list = response_dict['articles']

For more information on the response values click here.

Next, convert the ‘articles_list’ to a json string and then convert that json string to a data frame. Write the data frame to a csv. The data frame is data structure that is a part of the Pandas library. Any sort of data can be stored in a data frame.

df = pandas.read_json(json.dumps(articles_list))

1 2	df = pandas.read_json(json.dumps(articles_list))

Next, write the dataframe to a csv.

df.to_csv('/Users/appleapple/Desktop/news.csv')

1 2	df.to_csv('/Users/appleapple/Desktop/news.csv')

The CSV looks something like this:

Complete Project Code

import requests
import json
import pandas

# The headers remain the same for all the requests
headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf97d'}

# All the endpoints in this section

# To fetch the top headlines
top_headlines_url = 'https://newsapi.org/v2/top-headlines'
# To fetch news articles
everything_news_url = 'https://newsapi.org/v2/everything'
# To retrieve the sources
sources_url = 'https://newsapi.org/v2/sources'

# Add parameters to request URL based on what type of headlines news you want

# All the payloads in this section
headlines_payload = {'category': 'business', 'country': 'us'}
everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'}
sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'}

# Fire a request based on the requirement, just change the url and the params field

# Request to fetch the top headlines
# response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload)

# Request to fetch every news article
response = requests.get(url=everything_news_url, headers=headers, params=everything_payload)

# Request to fetch the sources
# response = requests.get(url=sources_url, headers=headers, params=sources_payload)

# If you just want to print
pretty_json_output = json.dumps(response.json(), indent=4)
print(pretty_json_output)
# print(response.json())

# To store the relevant json data to a csv

# Convert response to a pure json string
response_json_string = json.dumps(response.json())

# A json object is equivalent to a dictionary in Python
# retrieve json objects to a python dict
response_dict = json.loads(response_json_string)
print(response_dict)

# Info about articles is represented as an array in the json response
# A json array is equivalent to a list in python
# We want info only about articles
articles_list = response_dict['articles']

# We want info only about sources
# sources_list = response_dict['sources']
# And then you can specify one of these sources explicitly if you like while fetching the news

# Convert articles list to json string , convert json string to dataframe , write df to csv!
df = pandas.read_json(json.dumps(articles_list))

# Convert sources list to json string , convert json string to dataframe , write df to csv!
# df = pandas.read_json(json.dumps(sources_list))

# Using Pandas write the json data to a csv
df.to_csv('/Users/appleapple/Desktop/news.csv')

import requests

import json

import pandas

# The headers remain the same for all the requests

headers = {'Authorization': 'b7de039e294740bb84d8dff8c2bbf97d'}

# All the endpoints in this section

# To fetch the top headlines

top_headlines_url = 'https://newsapi.org/v2/top-headlines'

# To fetch news articles

everything_news_url = 'https://newsapi.org/v2/everything'

# To retrieve the sources

sources_url = 'https://newsapi.org/v2/sources'

# Add parameters to request URL based on what type of headlines news you want

# All the payloads in this section

headlines_payload = {'category': 'business', 'country': 'us'}

everything_payload = {'q': 'finance', 'language': 'en', 'sortBy': 'popularity'}

sources_payload = {'category': 'general', 'language': 'en', 'country': 'us'}

# Fire a request based on the requirement, just change the url and the params field

# Request to fetch the top headlines

# response = requests.get(url=top_headlines_url, headers=headers, params=headlines_payload)

# Request to fetch every news article

response = requests.get(url=everything_news_url, headers=headers, params=everything_payload)

# Request to fetch the sources

# response = requests.get(url=sources_url, headers=headers, params=sources_payload)

# If you just want to print

pretty_json_output = json.dumps(response.json(), indent=4)

print(pretty_json_output)

# print(response.json())

# To store the relevant json data to a csv

# Convert response to a pure json string

response_json_string = json.dumps(response.json())

# A json object is equivalent to a dictionary in Python

# retrieve json objects to a python dict

response_dict = json.loads(response_json_string)

print(response_dict)

# Info about articles is represented as an array in the json response

# A json array is equivalent to a list in python

# We want info only about articles

articles_list = response_dict['articles']

# We want info only about sources

# sources_list = response_dict['sources']

# And then you can specify one of these sources explicitly if you like while fetching the news

# Convert articles list to json string , convert json string to dataframe , write df to csv!

df = pandas.read_json(json.dumps(articles_list))

# Convert sources list to json string , convert json string to dataframe , write df to csv!

# df = pandas.read_json(json.dumps(sources_list))

# Using Pandas write the json data to a csv

df.to_csv('/Users/appleapple/Desktop/news.csv')

Course: REST API: Data Extraction with Python

Working with APIs is a skill requested for many jobs. Why?

APIs is the official way for data extraction and doing other automation stuff allowed by big websites. If there is an API allowing you to extract the data you need from a website, then you do not need regular web scraping.

Join our new course, REST APIs: Data Extraction and Automation with Python, for 90% OFF using this coupon:

https://www.udemy.com/course/rest-api-data-extraction-automation-python/?couponCode=REST-API-BLOG-NEWS

Parth Kinage

Worked for Accenture as Software Engineer, and currently pursuing my Master’s degree in Data Science.

Rating: 4.3/5. From 29 votes.

Please wait...

News API: Extracting News Headlines and Articles

Prerequisites

Get API key

Install Libraries

News API Project

Import Libraries

API Key

Headers

API Endpoints

Payloads

Requests

Response

Save Response to CSV

Complete Project Code

Course: REST API: Data Extraction with Python

Related

One Reply to “News API: Extracting News Headlines and Articles”

Leave a Reply Cancel reply

Prerequisites

Get API key

Install Libraries

News API Project

Import Libraries

API Key

Headers

API Endpoints

Payloads

Requests

Response

Save Response to CSV

Complete Project Code

Course: REST API: Data Extraction with Python

Share this tutorial:

Related

One Reply to “News API: Extracting News Headlines and Articles”

Leave a Reply Cancel reply

Want to learn more?