Beautiful Soup Tutorial #2: Extracting URLs

After installing the required libraries: BeautifulSoup, Requests, and LXML, let’s learn how to extract URLs.

I will start by talking informally, but you can find the formal terms in comments of the code. Needless to say, variable names can be anything else; we care more about the code workflow.

So we have 5 variables:

  1. url: It is the website/page you want to open.
  2. response: Great! Your internet connection works, your URL is correct; you are allowed to access this page. It is just like you can see the web page now in your browser.
  3. data: It is like you are using copy-paste to get the text, namely the source code, of the page into memory, but it is rather into a variable.
  4. soup: You are asking BeautifulSoup to parse text; firstly, by making a data structure out of the page to make it easy to navigate HTML tags.
  5. tags: You are now extracting specific tags like tags for links into a list so that you can loop on them later.

 

Extracting URLs is something you will be doing all the time in web scraping and crawling tasks. Why? Because you need to start by one page (e.g. book list) and then open sub-pages (e.g. the page of each book) to scrape data from it.

 

Now, here is the code if this lesson. It extracts all the URLs from a web page.

 

Read the code carefully and try to run it. Even try to change the “url” to other web pages.

 

Then, move to Beautiful Soup Tutorial #3: Extracting URLs: Web Scraping Craigslist

 

Let me know if you have questions.

 

✅ ✅ ✅  If you want to learn more about web scraping, you can join this online video course:

Web Scraping with Python: BeautifulSoup, Requests & Selenium   👈

👆

 

Share this:
Rating: 4.0/5. From 1 vote.
Please wait...

2 Replies to “Beautiful Soup Tutorial #2: Extracting URLs”

  1. this is really useful, thank you.

    How would you adapt this to also get the name of a link and output as a tab separated file?

    e.g.

    link1 name-of-link1
    link2 name-of-link2

    etc…

    No votes yet.
    Please wait...
    1. Hi! Scrapy has something called close function in which you can do anything with data you have just scraped, in this case, replacing commas to tabs in the CSV file. Or you can do this simply on the CSV file in a separate code without Scrapy, whatever easier for you. But this is the idea in general:, you convert the CSV to a tab-delimited file.

      No votes yet.
      Please wait...

Leave a Reply