Searching GitHub Using Python & GitHub API

GitHub is a web-based hosting service for version control using Git. It is mostly used for storing and sharing computer source code. It offers all of the distributed version control and source code management functionality of Git as well as adding its own features.

GitHub stores more than 3 million repositories with more than 1.7 million developers using it daily. With so much data, it can be quite daunting at first to find information one needs or do repetitive tasks, and that is when GitHub API comes handy.

In this tutorial, you are going to learn how to use GitHub API to search for repositories and files that much particular keywords(s) and retrieve their URLs using Python. You will learn also how to download files or a specific folder from a GitHub repository.

Project Setup

Personal Access Token

In order to access the GitHub API, you will need an access token to authorize API calls. Head over to GitHub to your token settings page. If you do not have a GitHub account, you will have to create one.

Click Generate New Token.

Enter the token description and check public_repo.

Scroll to the bottom and click Generate token.

Once your token is created, copy and save it somewhere for later use. Note, once you leave this page you will not see that token again.

 

Client Setup

The only package you need to install for python is PyGithub . Run:

Note: PyGithub is a third party library. Github only offers official client libraries for Ruby, Node.js and .NET.

 

Then, you need to import it.

 

GitHub API Test

With the access token obtained earlier,  you need to test your connection to the API. First of all, create a constant to hold your token:

 

Then initialize the GitHub client.

 

You can then try getting your list of repositories to test the connection.

 

The result should be something similar to the following.

Good. Now you are all set up.

This tutorial covers the following topics:

  1. Searching GitHub repos using the GitHub API
  2. Searching *.po  files using the GitHub API
  3. Downloading a folder from GitHub using svn

Before you proceed, make a copy of the script with access token so that you have two separate scripts for each section

 

Searching GitHub Repos

Capture Keywords

The first thing you need to do is capture keywords. Simply add the following snippet at the bottom of your script:

Take note of the suggestions in between the square brackets. It is always good to guide the user on the kind of input you require so that you do not spending a lot of trying to parse input provided.

 

Once the user provides the input you need to split into a list:

Here, you are splitting the keywords provided and trimming them of any unnecessary white-space. Python’s list comprehensions enable you to perform all this in one line.

Search Repositories

Now you need to add a function that will receive the keywords and search GitHub for repos that match.

There’s a couple of things happening in this function. First of all, you are taking the keywords and forming a GitHub search query. GitHub search queries taking the following format.

In your function, '+in:readme+in:description'  are the qualifiers. Once the query has been formed, you submit the query to GitHub ordering the results by the number of stars in descending order. When you get the results you print the total number of repos found and then print the clone URL for each one. At the bottom of your script and the function call with keywords as the parameter and run the script.

When you submit python, django, postgres as the input to the script you should end up with the following output.

To make the output more usable, you need to add the number of stars next to each URL, make the following modification.

Running the script with the same input as before, will give the following output.

Searching GitHub Files

In this section, you will search for *.po files (translation files) that include the name of a specific language.

Capture Keyword

The first thing you need to do is capture keywords. Simply, add the following snippet at the bottom of your script:

Take note of the suggestions in between the square brackets. It is always good to guide the user on the kind of input you require so that you do not spending a lot of trying to parse input provided.

Search Files

Now you need to add a function that will receive the keyword and search GitHub for files that contain it.

There’s a couple of things happening in this function. First of all, you are checking GitHub for the current API rate limit. In order to prevent blocking of future API calls, it is always good to check the current status of your limits before doing any call. If your rate checks out, you are taking the keyword and forming a GitHub search query.

In your function, 'in:file extension:po' are the qualifiers. You are only interested in *.po files which contain your keyword. Also note the max_size  variable. It’s used to limit the results returned to the first 100. Once the query has been formed, you submit the query to GitHub ordering the results in descending order. When you get the results, you print the total number of files found and then print the download URL for each one. At the bottom of your script add the function call with keyword as the parameter and run the script.

When you submit dutch as the input to the script you should end up with the following output.

There is so much that can be achieved with the GitHub API. You just need to take note of one important thing. When generating a personal access token, only check what you need. This is just an extra precaution in case your script falls into the wrong hands.

 

Download Files

To download files resulted from the previous script, you can use the Requests library.

After importing requests, the first line is simply the file URL. The second line is sending a request to connect to the URL. Finally, the last line writes the file content to a new file on the local machine.

You can add this portion of code to the loop for file in result you have created. In this case, you need to distinguish the file name maybe by its index number in the loop or by using filename = url[url.rfind("/")+1:] to extract the filename from the URL.

 

Downloading GitHub Folders

In the third section of this tutorial, you are going to learn how to download a single folder/directory from a GitHub repository. Please note that this section does not require the use of the GitHub API so just create a blank Python script.

Capture URL

The first thing you need to do is to capture the URL of the folder you want to download. In the second script you had created earlier, add the following.

When dealing with URLs, it’s always good to validate them before doing anything with them. There are several methods of doing it. For this tutorial, you are going to use a library which focuses on validation. Run:

Once you have installed the package add the validation logic at the bottom of the script.

Before adding the function for downloading the folder, you need to add one more dependency.

SVN (Subversion) is a centralized version control system, just like git. Git does not have a native command for downloading a sub-directory from a repo. The only way to get all the files from a sub-directory is to download all the files individually. This can be really tedious and thus the reason to use svn.

Note. In order for the SVN Python package to work, you need to make sure svn is installed on your system and can be launched from Terminal/Command Prompt.

Download the Folder

Once you have verified that svn is installed, add the function for downloading the folder.

In order to make svn work with the provided URL, you  need to replace tree/master with trunk. Git and svn share a lot of features but there are also a lot of differences between the two, the URL pattern being one of them.

Finally, add the function at the bottom of the script.

 

Now, try running the script, providing   https://github.com/pallets/flask/tree/master/examples  as the URL. A folder called output should be created with the contents of the folder specified in the URL.

 

Full Project Code (Searching Repos)

 

Full Project Code (Searching Files)

 

Full Project Code (Downloading a Folder)

 

Rating: 4.5/5. From 11 votes.
Please wait...

2 Replies to “Searching GitHub Using Python & GitHub API”

  1. Looks good. Near the top where we setup and right before defining “ACCESS_TOKEN”, you may want to include: “from github import Github”. I had to look at the final code to see why it wasn’t working. Thanks for this!

    No votes yet.
    Please wait...
    1. Hi Clark! Many thanks for your note. We have add the line.

      Rating: 5.0/5. From 1 vote.
      Please wait...

Leave a Reply