After installing the required libraries: BeautifulSoup, Requests, and LXML, let’s learn how to extract URLs.
I will start by talking informally, but you can find the formal terms in comments of the code. Needless to say, variable names can be anything else; we care more about the code workflow.
So we have 5 variables:
- url: It is the website/page you want to open.
- response: Great! Your internet connection works, your URL is correct; you are allowed to access this page. It is just like you can see the web page now in your browser.
- data: It is like you are using copy-paste to get the text, namely the source code, of the page into memory, but it is rather into a variable.
- soup: You are asking BeautifulSoup to parse text; firstly, by making a data structure out of the page to make it easy to navigate HTML tags.
- tags: You are now extracting specific tags like tags for links into a list so that you can loop on them later.
Extracting URLs is something you will be doing all the time in web scraping and crawling tasks. Why? Because you need to start by one page (e.g. book list) and then open sub-pages (e.g. the page of each book) to scrape data from it.
Now, here is the code if this lesson. It extracts all the URLs from a web page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
from bs4 import BeautifulSoup import requests url = "http://www.htmlandcssbook.com/code-samples/chapter-04/example.html" # Getting the webpage, creating a Response object. response = requests.get(url) # Extracting the source code of the page. data = response.text # Passing the source code to BeautifulSoup to create a BeautifulSoup object for it. soup = BeautifulSoup(data, 'lxml') # Extracting all the <a> tags into a list. tags = soup.find_all('a') # Extracting URLs from the attribute href in the <a> tags. for tag in tags: print(tag.get('href')) |
Read the code carefully and try to run it. Even try to change the “url” to other web pages.
Then, move to Beautiful Soup Tutorial #3: Extracting URLs: Web Scraping Craigslist
Let me know if you have questions.
✅ ✅ ✅ If you want to learn more about web scraping, you can join this online video course:
Web Scraping with Python: BeautifulSoup, Requests & Selenium 👈
👆
this is really useful, thank you.
How would you adapt this to also get the name of a link and output as a tab separated file?
e.g.
link1 name-of-link1
link2 name-of-link2
etc…
Hello! We have created a course that complete the process from a to z, including saving the scraped data to a CSV file. You can join it for FREE at: https://courses.gotrained.com/course/details/webscraping101
Hi ! Thank you ! How can we do if our url is a .json with a list of url ?
ex file .json:
[
“http://…..”,
“http://……”,
“http://……””
]
Do you have only one file or you will have it again and again? If only one, you can copy it to a Python list and loop over it. If you must handle the JSON programmatically, then use the json library to extract the list and then do the same looping over it.
can you please tell me how can i play mp4 video from url which is store in json file in python ?
@saqib – do you mean on a web interface? You need to check a web development framework like Flask or Django, and find a video player.
How could we code it so we could get the link of the next page as well in the same list.
@Rukshan Our free course can give you a better idea how to extract next pages with Requests and BeautifulSoup (link). You can also check Scrapy tutorials on our blog.