-
Notifications
You must be signed in to change notification settings - Fork 61
AWS Simple Storage Service (S3)
At least one AWS EC2 node (or you can use your laptop to test the set-up described here)
Login to the AWS Management Console and click on the S3 icon
Click the “Create Bucket” button to create a storage location for your data. You have to name your bucket with a globally unique name and make sure you choose a region near the rest of your cluster (e.g., N. Virginia if you are in New York and N. California if you are in Silicon Valley).
While you can transfer data from region to region, it can be costly and takes a significant amount of time. Continue by clicking “Create”.
You can now create folders by clicking on your newly created bucket and clicking the “Create Folder” button. You can upload files from your local system by using the GUI, which is fairly straightforward. See the documentation for more details.
In order to utilize S3 to its fullest, you should use the AWS Command Line Interface (CLI).
In order to use this you’ll need your AWS Access Key ID and Secret Key handy (you should have received these from your Program Director). Your credentials should look something like the below examples
- Access key ID example: AKIAIOSFODNN7EXAMPLE
- Secret access key example: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Note: Your account will be shut down temporarily if you post your AWS credentials online (e.g. accidentally push them to Github). As soon as you push anything to Github, it is publicly available and Amazon regularly checks for credentials. Thus, it’s important to store your credentials in a file that is ignored (using the .gitignore
file) or in your .profile
file.
SSH into any one of your EC2 instances with:
Create a Downloads
directory (if it doesn’t already exist).
any-node:~$ mkdir ~/Downloads
Then download, unzip, and install the AWS CLI
any-node:~$ curl -L https://s3.amazonaws.com/aws-cli/awscli-bundle.zip -o ~/Downloads/awscli-bundle.zip
any-node:~$ unzip ~/Downloads/awscli-bundle.zip
any-node:~$ sudo ./awscli-bundle/install -i /usr/local/aws -b /usr/local/bin/aws
You can ensure that the CLI is installed correctly and get help with the aws help
command.
Configure your AWS Credentials as environment variables by exporting the AWS credentials described above in your .profile
file.
Edit your .profile file
any-node:~$ nano ~/.profile
Add the following for the AWS environment variables (but with your own credentials)
export AWS_ACCESS_KEY_ID=<your-access-key-id>
export AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
export AWS_DEFAULT_REGION=<your-cluster-region>
then source the profile to load the environment variables
any-node:~$ . ~/.profile
The cluster region is given by the location according to the table below:
N. California | us-west-1 |
Oregon | us-west-2 |
N. Virginia | us-east-1 |
Next, create an example directory and a file named Who.txt
using an editor
any-node$ mkdir ~/s3-examples
any-node$ nano ~/s3-examples/Who.txt
and copy the following text into it:
So call a big meeting
Get everyone out out
Make every Who holler
Make every Who shout shout
At this point, the file is only on the local Linux File System of the EC2 node (or your laptop if that is where you are testing this).
Make a new bucket on the S3 (but with your own unique name) with the mb command and copy the file to it.
any-node$ aws s3 mb s3://<your-unique-bucket-name>
any-node$ aws s3 cp ~/s3-examples/Who.txt s3://<your-unique-bucket-name>/examples/
Note that the examples
folder on S3 didn’t already exist, it was created by the cp command itself.
You can view the copied directory on the S3 WebUI, or you can use the API as demonstrated below.
##Using the AWS SDK
The best way to use S3 (and many of the AWS tools) is to use one of the Software Development Kits (SDK). In particular, AWS provides SDKs for Java, Python, and many other languages. The Python SDK Boto is very popular among Fellows and you can find good examples in the documentation given in the previous link. With that said, getting started is usually as simple as installing boto with pip or from Github, then including the access keys from the environment variable with something like:
import os
import boto
aws_access_key = os.getenv('AWS_ACCESS_KEY_ID', 'default')
aws_secret_access_key = os.getenv('AWS_SECRET_ACCESS_KEY', 'default')
bucket_name = "your-bucket-name"
folder_name = "your-folder-name/"
file_name = "your-file-name"
conn = boto.connect_s3(aws_access_key, aws_secret_access_key)
bucket = conn.get_bucket(bucket_name)
key = bucket.get_key(folder_name + file_name)
data = key.get_contents_as_string()
print data
Warning: DO NOT commit you access or secret key to Github, use environment variables instead.
Find out more about the Insight Data Engineering Fellows Program in New York and Silicon Valley, apply today, or sign up for program updates.
You can also read our engineering blog here.