GitHub is a web-based hosting service for version control using Git. It is mostly used for storing and sharing computer source code. It offers all of the distributed version control and source code management functionality of Git as well as adding its own features.
GitHub stores more than 3 million repositories with more than 1.7 million developers using it daily. With so much data, it can be quite daunting at first to find information one needs or do repetitive tasks, and that is when GitHub API comes handy.
In this tutorial, you are going to learn how to use GitHub API to search for repositories and files that much particular keywords(s) and retrieve their URLs using Python. You will learn also how to download files or a specific folder from a GitHub repository.
Project Setup
Personal Access Token
In order to access the GitHub API, you will need an access token to authorize API calls. Head over to GitHub to your token settings page. If you do not have a GitHub account, you will have to create one.
Click Generate New Token.
Enter the token description and check public_repo.
Scroll to the bottom and click Generate token.
Once your token is created, copy and save it somewhere for later use. Note, once you leave this page you will not see that token again.
Client Setup
The only package you need to install for python is PyGithub . Run:
1 2 |
pip install PyGithub |
Note: PyGithub is a third party library. Github only offers official client libraries for Ruby, Node.js and .NET.
Then, you need to import it.
1 2 |
from github import Github |
GitHub API Test
With the access token obtained earlier, you need to test your connection to the API. First of all, create a constant to hold your token:
1 2 |
ACCESS_TOKEN = 'put your token here' |
Then initialize the GitHub client.
1 2 |
g = Github(ACCESS_TOKEN) |
You can then try getting your list of repositories to test the connection.
1 2 |
print(g.get_user().get_repos()) |
The result should be something similar to the following.
1 2 |
<github.PaginatedList.PaginatedList object at ......> |
Good. Now you are all set up.
This tutorial covers the following topics:
- Searching GitHub repos using the GitHub API
- Searching *.po files using the GitHub API
- Downloading a folder from GitHub using svn
Before you proceed, make a copy of the script with access token so that you have two separate scripts for each section
Searching GitHub Repos
Capture Keywords
The first thing you need to do is capture keywords. Simply add the following snippet at the bottom of your script:
1 2 3 |
if __name__ == '__main__': keywords = input('Enter keyword(s)[e.g python, flask, postgres]: ') |
Take note of the suggestions in between the square brackets. It is always good to guide the user on the kind of input you require so that you do not spending a lot of trying to parse input provided.
Once the user provides the input you need to split into a list:
1 2 |
keywords = [keyword.strip() for keyword in keywords.split(',')] |
Here, you are splitting the keywords provided and trimming them of any unnecessary white-space. Python’s list comprehensions enable you to perform all this in one line.
Search Repositories
Now you need to add a function that will receive the keywords and search GitHub for repos that match.
1 2 3 4 5 6 7 8 9 |
def search_github(keywords): query = '+'.join(keywords) + '+in:readme+in:description' result = g.search_repositories(query, 'stars', 'desc') print(f'Found {result.totalCount} repo(s)') for repo in result: print(repo.clone_url) |
There’s a couple of things happening in this function. First of all, you are taking the keywords and forming a GitHub search query. GitHub search queries taking the following format.
1 2 |
SEARCH_KEYWORD_1+SEARCH_KEYWORD_N+QUALIFIER_1+QUALIFIER_N |
In your function, '+in:readme+in:description' are the qualifiers. Once the query has been formed, you submit the query to GitHub ordering the results by the number of stars in descending order. When you get the results you print the total number of repos found and then print the clone URL for each one. At the bottom of your script and the function call with keywords as the parameter and run the script.
1 2 3 |
keywords = [keyword.strip() for keyword in keywords.split(',')] search_github(keywords) |
When you submit python, django, postgres as the input to the script you should end up with the following output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
Found 54 repo(s) https://github.com/citusdata/django-multitenant.git https://github.com/dheerajchand/ubuntu-django-nginx-ansible.git https://github.com/chenjr0719/Docker-Django-Nginx-Postgres.git https://github.com/nomadjourney/python-box.git https://github.com/laitassou/etherkar.git https://github.com/the-vampiire/medi_assessment.git https://github.com/mapes911/django-vagrant-box.git https://github.com/sathyaNarayanC/registration-form.git https://github.com/joshimiloni/AAM-Book-Exchange.git https://github.com/dxvxd/vagrant-py3-django-pgSQL.git https://github.com/desarroll0/lostItems.git https://github.com/cjroth/example-docker-cloud-project.git ..... |
To make the output more usable, you need to add the number of stars next to each URL, make the following modification.
1 2 3 |
for repo in result: print(f'{repo.clone_url}, {repo.stargazers_count} stars') |
Running the script with the same input as before, will give the following output.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Found 54 repo(s) https://github.com/citusdata/django-multitenant.git, 181 stars https://github.com/dheerajchand/ubuntu-django-nginx-ansible.git, 15 stars https://github.com/chenjr0719/Docker-Django-Nginx-Postgres.git, 6 stars https://github.com/nomadjourney/python-box.git, 4 stars https://github.com/laitassou/etherkar.git, 2 stars https://github.com/the-vampiire/medi_assessment.git, 2 stars https://github.com/mapes911/django-vagrant-box.git, 1 stars https://github.com/sathyaNarayanC/registration-form.git, 1 stars https://github.com/dxvxd/vagrant-py3-django-pgSQL.git, 1 stars https://github.com/joshimiloni/AAM-Book-Exchange.git, 1 stars https://github.com/desarroll0/lostItems.git, 1 stars ..... |
Searching GitHub Files
In this section, you will search for *.po files (translation files) that include the name of a specific language.
Capture Keyword
The first thing you need to do is capture keywords. Simply, add the following snippet at the bottom of your script:
1 2 3 |
if __name__ == '__main__': keyword = input('Enter keyword[e.g french, german etc]: ') |
Take note of the suggestions in between the square brackets. It is always good to guide the user on the kind of input you require so that you do not spending a lot of trying to parse input provided.
Search Files
Now you need to add a function that will receive the keyword and search GitHub for files that contain it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
def search_github(keyword): rate_limit = g.get_rate_limit() rate = rate_limit.search if rate.remaining == 0: print(f'You have 0/{rate.limit} API calls remaining. Reset time: {rate.reset}') return else: print(f'You have {rate.remaining}/{rate.limit} API calls remaining') query = f'"{keyword} english" in:file extension:po' result = g.search_code(query, order='desc') max_size = 100 print(f'Found {result.totalCount} file(s)') if result.totalCount > max_size: result = result[:max_size] for file in result: print(f'{file.download_url}') |
There’s a couple of things happening in this function. First of all, you are checking GitHub for the current API rate limit. In order to prevent blocking of future API calls, it is always good to check the current status of your limits before doing any call. If your rate checks out, you are taking the keyword and forming a GitHub search query.
In your function, 'in:file extension:po' are the qualifiers. You are only interested in *.po files which contain your keyword. Also note the max_size variable. It’s used to limit the results returned to the first 100. Once the query has been formed, you submit the query to GitHub ordering the results in descending order. When you get the results, you print the total number of files found and then print the download URL for each one. At the bottom of your script add the function call with keyword as the parameter and run the script.
1 2 3 |
.... search_github(keywords) |
When you submit dutch as the input to the script you should end up with the following output.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
You have 28/30 API calls remaining Found 196 file(s) https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/ar/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/en_GB/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdei18n/d5c80ababe3d6a39dcde39605080ddf07856215e/en_GB/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdei18n/d5c80ababe3d6a39dcde39605080ddf07856215e/ar/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/af/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/az/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/br/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/bs/messages/kdeedu/klettres.po https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/cy/messages/kdeedu/klettres.po .......... |
There is so much that can be achieved with the GitHub API. You just need to take note of one important thing. When generating a personal access token, only check what you need. This is just an extra precaution in case your script falls into the wrong hands.
Download Files
To download files resulted from the previous script, you can use the Requests library.
1 2 3 4 5 6 7 8 |
import requests url = "https://raw.githubusercontent.com/iegor/kdesktop/d5dccbe01eeb7c0e82ac5647cf2bc2d4c7beda0b/kde-i18n/ar/messages/kdeedu/klettres.po" r = requests.get(url) open("file.po", "wb").write(r.content) |
After importing requests, the first line is simply the file URL. The second line is sending a request to connect to the URL. Finally, the last line writes the file content to a new file on the local machine.
You can add this portion of code to the loop for file in result you have created. In this case, you need to distinguish the file name maybe by its index number in the loop or by using filename = url[url.rfind("/")+1:] to extract the filename from the URL.
Downloading GitHub Folders
In the third section of this tutorial, you are going to learn how to download a single folder/directory from a GitHub repository. Please note that this section does not require the use of the GitHub API so just create a blank Python script.
Capture URL
The first thing you need to do is to capture the URL of the folder you want to download. In the second script you had created earlier, add the following.
1 2 |
url = input('Enter folder url: ') |
When dealing with URLs, it’s always good to validate them before doing anything with them. There are several methods of doing it. For this tutorial, you are going to use a library which focuses on validation. Run:
1 2 |
pip install validators |
Once you have installed the package add the validation logic at the bottom of the script.
1 2 3 4 5 6 7 8 |
import validators .... if not validators.url(url): print('Invalid url') else: pass |
Before adding the function for downloading the folder, you need to add one more dependency.
1 2 |
pip install svn |
SVN (Subversion) is a centralized version control system, just like git. Git does not have a native command for downloading a sub-directory from a repo. The only way to get all the files from a sub-directory is to download all the files individually. This can be really tedious and thus the reason to use svn.
Note. In order for the SVN Python package to work, you need to make sure svn is installed on your system and can be launched from Terminal/Command Prompt.
Download the Folder
Once you have verified that svn is installed, add the function for downloading the folder.
1 2 3 4 5 6 7 8 9 10 |
from svn.remote import RemoteClient .... def download_folder(url): if 'tree/master' in url: url = url.replace('tree/master', 'trunk') r = RemoteClient(url) r.export('output') |
In order to make svn work with the provided URL, you need to replace tree/master with trunk. Git and svn share a lot of features but there are also a lot of differences between the two, the URL pattern being one of them.
Finally, add the function at the bottom of the script.
1 2 3 4 5 |
if not validators.url(url): print('Invalid url') else: download_folder(url) |
Now, try running the script, providing https://github.com/pallets/flask/tree/master/examples as the URL. A folder called output should be created with the contents of the folder specified in the URL.
Full Project Code (Searching Repos)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
from github import Github ACCESS_TOKEN = 'put your token here' g = Github(ACCESS_TOKEN) def search_github(keywords): query = '+'.join(keywords) + '+in:readme+in:description' result = g.search_repositories(query, 'stars', 'desc') print(f'Found {result.totalCount} repo(s)') for repo in result: print(f'{repo.clone_url}, {repo.stargazers_count} stars') if __name__ == '__main__': keywords = input('Enter keyword(s)[e.g python, flask, postgres]: ') keywords = [keyword.strip() for keyword in keywords.split(',')] search_github(keywords) |
Full Project Code (Searching Files)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
from github import Github ACCESS_TOKEN = 'put your token hear' g = Github(ACCESS_TOKEN) def search_github(keyword): rate_limit = g.get_rate_limit() rate = rate_limit.search if rate.remaining == 0: print(f'You have 0/{rate.limit} API calls remaining. Reset time: {rate.reset}') return else: print(f'You have {rate.remaining}/{rate.limit} API calls remaining') query = f'"{keyword} english" in:file extension:po' result = g.search_code(query, order='desc') max_size = 100 print(f'Found {result.totalCount} file(s)') if result.totalCount > max_size: result = result[:max_size] for file in result: print(f'{file.download_url}') if __name__ == '__main__': keyword = input('Enter keyword[e.g french, german etc]: ') search_github(keyword) |
Full Project Code (Downloading a Folder)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
import validators from svn.remote import RemoteClient def download_folder(url): if 'tree/master' in url: url = url.replace('tree/master', 'trunk') r = RemoteClient(url) r.export('output') if __name__ == '__main__': url = input('Enter folder url: ') if not validators.url(url): print('Invalid url') else: download_folder(url) |
Software Engineer & Dancer. Or is it the other way around? 🙂
Looks good. Near the top where we setup and right before defining “ACCESS_TOKEN”, you may want to include: “from github import Github”. I had to look at the final code to see why it wasn’t working. Thanks for this!
Hi Clark! Many thanks for your note. We have add the line.