Beautiful Soup Tutorial #2: Extracting URLs

After installing the required libraries: BeautifulSoup, Requests, and LXML, let’s learn how to extract URLs.

I will start by talking informally, but you can find the formal terms in comments of the code. Needless to say, variable names can be anything else; we care more about the code workflow.

So we have 5 variables:

  1. url: It is the website/page you want to open.
  2. response: Great! Your internet connection works, your URL is correct; you are allowed to access this page. It is just like you can see the web page now in your browser.
  3. data: It is like you are using copy-paste to get the text, namely the source code, of the page into memory, but it is rather into a variable.
  4. soup: You are asking BeautifulSoup to parse text; firstly, by making a data structure out of the page to make it easy to navigate HTML tags.
  5. tags: You are now extracting specific tags like tags for links into a list so that you can loop on them later.

 

Extracting URLs is something you will be doing all the time in web scraping and crawling tasks. Why? Because you need to start by one page (e.g. book list) and then open sub-pages (e.g. the page of each book) to scrape data from it.

 

Now, here is the code if this lesson. It extracts all the URLs from a web page.

 

Read the code carefully and try to run it. Even try to change the “url” to other web pages.

 

Then, move to Beautiful Soup Tutorial #3: Extracting URLs: Web Scraping Craigslist

 

Let me know if you have questions.

 

✅ ✅ ✅  If you want to learn more about web scraping, you can join this online video course:

Web Scraping with Python: BeautifulSoup, Requests & Selenium   👈

👆

 

 

Rating: 4.8/5. From 5 votes.
Please wait...

8 Replies to “Beautiful Soup Tutorial #2: Extracting URLs”

  1. this is really useful, thank you.

    How would you adapt this to also get the name of a link and output as a tab separated file?

    e.g.

    link1 name-of-link1
    link2 name-of-link2

    etc…

    No votes yet.
    Please wait...
    1. Hello! We have created a course that complete the process from a to z, including saving the scraped data to a CSV file. You can join it for FREE at: https://courses.gotrained.com/course/details/webscraping101

      No votes yet.
      Please wait...
  2. Hi ! Thank you ! How can we do if our url is a .json with a list of url ?
    ex file .json:
    [
    “http://…..”,
    “http://……”,
    “http://……””
    ]

    No votes yet.
    Please wait...
    1. Do you have only one file or you will have it again and again? If only one, you can copy it to a Python list and loop over it. If you must handle the JSON programmatically, then use the json library to extract the list and then do the same looping over it.

      Rating: 4.5/5. From 2 votes.
      Please wait...
  3. can you please tell me how can i play mp4 video from url which is store in json file in python ?

    No votes yet.
    Please wait...
    1. @saqib – do you mean on a web interface? You need to check a web development framework like Flask or Django, and find a video player.

      No votes yet.
      Please wait...
  4. How could we code it so we could get the link of the next page as well in the same list.

    No votes yet.
    Please wait...
    1. @Rukshan Our free course can give you a better idea how to extract next pages with Requests and BeautifulSoup (link). You can also check Scrapy tutorials on our blog.

      No votes yet.
      Please wait...

Leave a Reply