Beautiful Soup Tutorial #2: Extracting URLs

After installing the required libraries: BeautifulSoup, Requests, and LXML, let’s learn how to extract URLs.

I will start by talking informally, but you can find the formal terms in comments of the code. Needless to say, variable names can be anything else; we care more about the code workflow.

So we have 5 variables:

url: It is the website/page you want to open.
response: Great! Your internet connection works, your URL is correct; you are allowed to access this page. It is just like you can see the web page now in your browser.
data: It is like you are using copy-paste to get the text, namely the source code, of the page into memory, but it is rather into a variable.
soup: You are asking BeautifulSoup to parse text; firstly, by making a data structure out of the page to make it easy to navigate HTML tags.
tags: You are now extracting specific tags like tags for links into a list so that you can loop on them later.

Extracting URLs is something you will be doing all the time in web scraping and crawling tasks. Why? Because you need to start by one page (e.g. book list) and then open sub-pages (e.g. the page of each book) to scrape data from it.

Now, here is the code if this lesson. It extracts all the URLs from a web page.

from bs4 import BeautifulSoup
import requests

url = "http://www.htmlandcssbook.com/code-samples/chapter-04/example.html"

# Getting the webpage, creating a Response object.
response = requests.get(url)

# Extracting the source code of the page.
data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.
tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
    print(tag.get('href'))

from bs4 import BeautifulSoup

import requests

url = "http://www.htmlandcssbook.com/code-samples/chapter-04/example.html"

# Getting the webpage, creating a Response object.

response = requests.get(url)

# Extracting the source code of the page.

data = response.text

# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.

soup = BeautifulSoup(data, 'lxml')

# Extracting all the <a> tags into a list.

tags = soup.find_all('a')

# Extracting URLs from the attribute href in the <a> tags.

for tag in tags:

print(tag.get('href'))

Read the code carefully and try to run it. Even try to change the “url” to other web pages.

Then, move to Beautiful Soup Tutorial #3: Extracting URLs: Web Scraping Craigslist

Let me know if you have questions.

✅ ✅ ✅ If you want to learn more about web scraping, you can join this online video course:

Web Scraping with Python: BeautifulSoup, Requests & Selenium 👈

👆

Rating: 4.8/5. From 5 votes.

Please wait...

8 Replies to “Beautiful Soup Tutorial #2: Extracting URLs”

this is really useful, thank you.

How would you adapt this to also get the name of a link and output as a tab separated file?

e.g.

link1 name-of-link1
link2 name-of-link2

etc…

No votes yet.

Please wait...

GoTrained says:

December 27, 2018 at 1:06 pm

Hello! We have created a course that complete the process from a to z, including saving the scraped data to a CSV file. You can join it for FREE at: https://courses.gotrained.com/course/details/webscraping101

Rate this item:

No votes yet.

Please wait...

Reply

Hi ! Thank you ! How can we do if our url is a .json with a list of url ?
ex file .json:
[
“http://…..”,
“http://……”,
“http://……””
]

No votes yet.

Please wait...

GoTrained says:

March 15, 2019 at 5:43 pm

Do you have only one file or you will have it again and again? If only one, you can copy it to a Python list and loop over it. If you must handle the JSON programmatically, then use the json library to extract the list and then do the same looping over it.

Rate this item:

Rating: 4.5/5. From 2 votes.

Please wait...

Reply

can you please tell me how can i play mp4 video from url which is store in json file in python ?

No votes yet.

Please wait...

GoTrained says:

August 2, 2019 at 6:44 am

@saqib – do you mean on a web interface? You need to check a web development framework like Flask or Django, and find a video player.

Rate this item:

No votes yet.

Please wait...

Reply

How could we code it so we could get the link of the next page as well in the same list.

No votes yet.

Please wait...

GoTrained says:

October 21, 2019 at 7:58 am

@Rukshan Our free course can give you a better idea how to extract next pages with Requests and BeautifulSoup (link). You can also check Scrapy tutorials on our blog.

Rate this item:

No votes yet.

Please wait...

Reply

AS says:

August 20, 2018 at 7:40 pm

this is really useful, thank you.

How would you adapt this to also get the name of a link and output as a tab separated file?

e.g.

link1 name-of-link1
link2 name-of-link2

etc…

Rate this item:

No votes yet.

Please wait...

1. GoTrained says:
  
  December 27, 2018 at 1:06 pm
  
  Hello! We have created a course that complete the process from a to z, including saving the scraped data to a CSV file. You can join it for FREE at: https://courses.gotrained.com/course/details/webscraping101
  
  Rate this item:
  
  No votes yet.
  
  Please wait...
  
CS says:

March 14, 2019 at 6:31 pm

Hi ! Thank you ! How can we do if our url is a .json with a list of url ?
ex file .json:
[
“http://…..”,
“http://……”,
“http://……””
]

Rate this item:

No votes yet.

Please wait...

1. GoTrained says:
  
  March 15, 2019 at 5:43 pm
  
  Do you have only one file or you will have it again and again? If only one, you can copy it to a Python list and loop over it. If you must handle the JSON programmatically, then use the json library to extract the list and then do the same looping over it.
  
  Rate this item:
  
  Rating: 4.5/5. From 2 votes.
  
  Please wait...
  
saqib says:

May 15, 2019 at 3:11 pm

can you please tell me how can i play mp4 video from url which is store in json file in python ?

Rate this item:

No votes yet.

Please wait...

1. GoTrained says:
  
  August 2, 2019 at 6:44 am
  
  @saqib – do you mean on a web interface? You need to check a web development framework like Flask or Django, and find a video player.
  
  Rate this item:
  
  No votes yet.
  
  Please wait...
  
Rukshan says:

September 25, 2019 at 3:40 pm

How could we code it so we could get the link of the next page as well in the same list.

Rate this item:

No votes yet.

Please wait...

1. GoTrained says:
  
  October 21, 2019 at 7:58 am
  
  @Rukshan Our free course can give you a better idea how to extract next pages with Requests and BeautifulSoup (link). You can also check Scrapy tutorials on our blog.
  
  Rate this item:
  
  No votes yet.
  
  Please wait...

Related

8 Replies to “Beautiful Soup Tutorial #2: Extracting URLs”

Leave a Reply Cancel reply

Share this tutorial:

Related

8 Replies to “Beautiful Soup Tutorial #2: Extracting URLs”

Leave a Reply Cancel reply

Want to learn more?