Amazon S3 is the Simple Storage Service provided by Amazon Web Services (AWS) for object based file storage. With the increase of Big Data Applications and cloud computing, it is absolutely necessary that all the “big data” shall be stored on the cloud for easy processing over the cloud applications.
In this tutorial, you will learn how to use Amazon S3 service via the Python library Boto3. You will learn how to create S3 Buckets and Folders, and how to upload and access files to and from S3 buckets. Eventually, you will have a Python code that you can run on EC2 instance and access your data on the cloud while it is stored on the cloud.
Introduction
Amazon Simple Storage Service (Amazon S3) is the data storage service provided by Amazon Web Services (AWS), which is used by many companies in different domains. It provides services like Data Lakes and Analytics, Disaster Recovery, Data Archive, Cloud-native Application Data and Data Backup.
Why to use S3 over EC2 for data storage?
- S3 is highly scalable. EC2 needs scaling when it comes to high data digest.
- S3 is durable. The data over S3 is replicated and duplicated across multiple data centers to avoid data loss and data failure. EC2 needs to take snapshots of EBS volume to keep the data durable. Any data that has not been snapshot would get loss once EC2 instance is terminated.
- S3 has security in built. EC2 needs installation of various software based on the OS to keep the data secure.
- S3 is pay-as-you-go. You only need to pay for the storage that is consumed, depending on how fast the data is consumed and retrieved. EC2 is processioned. Meaning, you will need to define the EBS volumes before you can provision one EC2 instance. You pay for the entire volume stack, even though only a fraction of it is used.
- S3 makes file sharing much more easier by giving link to direct download access. EC2 needs VPN configurations to share the data.
- For large amount of data, that may be needed by multiple application and needs much data replication, S3 is much more cheaper than EC2, whose main purpose is computation.
AWS CLI Installation and Boto3 Configuration
In order to access S3 via python, you will need to configure and install AWS CLI and Boto3 Python library. I have already explained that in my previous post.
Follow along on how to Install AWS CLI and How to Configure and Install Boto3 Library from that post.
S3 Client
First, import the Boto3 library
1 2 |
import boto3 |
Create the boto3 client.
1 2 |
s3 = boto3.client('s3') |
Getting Response
Create a response variable and print it.
1 2 3 |
response = s3.list_buckets() print(response) |
You get a JSON response
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{'Buckets': [{'CreationDate': datetime.datetime(2019, 3, 6, 16, 35, 37, tzinfo=tzutc()), 'Name': 'testbuckethp'}, {'CreationDate': datetime.datetime(2019, 3, 6, 16, 35, 46, tzinfo=tzutc()), 'Name': 'testbuckethp2'}], 'Owner': {'DisplayName': 'harshvypandya', 'ID': '16150aa9e7d75fa3c4086a6c5bec7c36aaf1fcac0251a9557cc914148c8205e7'}, 'ResponseMetadata': {'HTTPHeaders': {'content-type': 'application/xml', 'date': 'Wed, 06 Mar 2019 16:36:10 GMT', 'server': 'AmazonS3', 'transfer-encoding': 'chunked', 'x-amz-id-2': 'ms7TdSsaLB94+Ro8ugEhRx2IbmwJocNLfqf0DhndH5nRquwRg5rwRuIbRY+Rlun2qBl12byu9rw=', 'x-amz-request-id': '41AF05A6535EBB7D'}, 'HTTPStatusCode': 200, 'HostId': 'ms7TdSsaLB94+Ro8ugEhRx2IbmwJocNLfqf0DhndH5nRquwRg5rwRuIbRY+Rlun2qBl12byu9rw=', 'RequestId': '41AF05A6535EBB7D', 'RetryAttempts': 0}} |
Use the following function to extract the necessary information. You need to import Pandas first.
1 2 |
import pandas as pd |
1 2 3 4 5 6 |
def status_df(response): return pd.DataFrame({ 'ts': [(x['CreationDate']) for x in response['Buckets']], 'bucket name': [x['Name'] for x in response['Buckets']] }) |
You can invoke the function as
1 2 |
status_df(response) |
As shown, I have 2 S3 buckets named testbuckethp
and testbuckethp2
. It is required that your bucket is unique, globally. As a result, you might need put in some efforts to come up with a unique name.
Create a S3 Bucket
Let us create a bucket from the python terminal.
1 2 |
s3.create_bucket(Bucket='testbuckethp3py') |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 |
{'Location': '/testbuckethp3py', 'ResponseMetadata': {'HTTPHeaders': {'content-length': '0', 'date': 'Wed, 06 Mar 2019 16:36:44 GMT', 'location': '/testbuckethp3py', 'server': 'AmazonS3', 'x-amz-id-2': '0kVfDcfYvtb7D2/JFx6L+uJb12TlCng9OppijphFZ5cFxl6EajMfI+uTQhfqTFcpXH/+YXN/Hog=', 'x-amz-request-id': 'D1164B16FA1F5EC9'}, 'HTTPStatusCode': 200, 'HostId': '0kVfDcfYvtb7D2/JFx6L+uJb12TlCng9OppijphFZ5cFxl6EajMfI+uTQhfqTFcpXH/+YXN/Hog=', 'RequestId': 'D1164B16FA1F5EC9', 'RetryAttempts': 0} |
Let us check the status dataframe that lists all the buckets and their creation time.
1 2 3 |
response = s3.list_buckets() status_df(response) |
As you can see, now I have three buckets namely, testbuckethp
, testbuckethp2
 and a newly made testbuckethp3py
.
Upload a File into the Bucket
You need to specify the path to the file that you want to upload, the bucket name and what do you want to name the file on your bucket.
1 2 |
s3.upload_file('testfile.txt', 'testbuckethp3py', 'testfile_s3.txt') |
In this case, you have a file called testfile.txt
 in the same directory as you Python script. I want to upload that to the newly created s3 bucket with the name testfile_s3.txt
. The code does not return anything and hence passes without error.
Creating Folder Structure
S3 does NOT have a folder structure at all. Even though you would have a button to create a folder on AWS Web portal. Even the official documentation for that has the time “Create Folder”.
In fact, S3 is simply key-value pair storage system. Each object is given a unique key across the bucket and hence the object access is faster than a directory level file access. In this key /
is interpreted as a directory and as a result, you can specify as many as directory-sub directory as possible, without actually creating it.
Putting an object is very similar to uploading a file, except, it needs the body of the file rather than the filepath.
1 2 3 4 5 |
with open('testfile.txt', 'r') as f: content = f.read() s3.put_object(Body = content, Bucket = 'testbuckethp3py', Key = 'testdir/testfile.txt') |
The code first gets the body of the file by reading it. Next, it created the directory like structure on the bucket, as specified by the key
 ‘testdir/testfile.txt’.
As you can see, the S3 bucket creates a folder and in that folder, I can see the file, testfile.txt
.
This way, you can structure your data, in the way you desire.
S3 Application in Data Science
In order to understand the application of S3 in Data Science, let us upload some data to S3. For the tutorial, I am using US City Population data by data.gov, which can be found here
I have extracted a small piece of the data, with New York State data only. Similar to a text file uploaded as an object, you can upload the csv file as well.
1 2 3 4 5 |
with open('data/New York/New York_population.csv', 'r') as f: content = f.read() s3.put_object(Body = content, Bucket = 'testbuckethp3py', Key = 'NewYork/population.csv') |
Output:
1 2 3 4 5 6 7 8 9 10 11 12 |
{'ETag': '"7d0c9fd9023234f3508633c1866dfb98"', 'ResponseMetadata': {'HTTPHeaders': {'content-length': '0', 'date': 'Wed, 06 Mar 2019 16:51:02 GMT', 'etag': '"7d0c9fd9023234f3508633c1866dfb98"', 'server': 'AmazonS3', 'x-amz-id-2': 'ub//o0r9B+6mQD0bQFOcHGQKWSVrn4eol897bmijhbWxYWvi5lk/xscQxys6LM/k/KduVHHIsG4=', 'x-amz-request-id': '1540534E1F5ED7A7'}, 'HTTPStatusCode': 200, 'HostId': 'ub//o0r9B+6mQD0bQFOcHGQKWSVrn4eol897bmijhbWxYWvi5lk/xscQxys6LM/k/KduVHHIsG4=', 'RequestId': '1540534E1F5ED7A7', 'RetryAttempts': 0}} |
As you see, it creates a new folder (New York) like structure and inside that folder, I can see my csv file.
File Access from S3
In order to access the file, unlike the client object, you need the resource object.
Create the resource object.
1 2 |
s3_resource = boto3.resource('s3') |
In order to access the object, you need to have the right bucketname and the right key.
1 2 |
s3_object = s3_resource.Object(bucket_name='testbuckethp3py', key='NewYork/population.csv') |
From this object, you need to access the body of the object.
1 2 3 4 5 6 7 |
from io import StringIO import pandas as pd s3_data = StringIO(s3_object.get()['Body'].read().decode('utf-8')) data = pd.read_csv(s3_data) print(data.head()) |
Output:
Conclusion
S3 provides secure, durable and most available solution to data storage over cloud. Through boto3 python library, you can access the data pragmatically and make seamless applications that has higher data retrieval rates.
Complete Project Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 |
import boto3 import pandas as pd # helper function to print s3 buckets def status_df(response): return pd.DataFrame({ 'ts': [(x['CreationDate']) for x in response['Buckets']], 'bucket name': [x['Name'] for x in response['Buckets']] }) # create client object s3 = boto3.client('s3') # list all s3 buckets response = s3.list_buckets() print(response) print(status_df(response)) # create S3 bucket s3.create_bucket(Bucket='testbuckethp3py') response = s3.list_buckets() status_df(response) # upload file with open('testfile.txt', 'r') as f: content = f.read() print(content) s3.upload_file('testfile.txt', 'testbuckethp3py', 'testfile_s3.txt') # create directory structure s3.put_object(Body = content, Bucket = 'testbuckethp3py', Key = 'testdir/testfile.txt') # upload a csv file with open('data/New York/New York_population.csv', 'r') as f: content = f.read() s3.put_object(Body = content, Bucket = 'testbuckethp3py', Key = 'NewYork/population.csv') # access file from s3 s3_resource = boto3.resource('s3') s3_object = s3_resource.Object(bucket_name='testbuckethp3py', key='NewYork/population.csv') from io import StringIO s3_data = StringIO(s3_object.get()['Body'].read().decode('utf-8')) data = pd.read_csv(s3_data) print(data.head()) |
I am a Data Scientist in the Manufacturing / IoT domain and a ML enthusiast. I like to work on data analysis and data infrastructure projects as well. Looking forward to connect.