-
Notifications
You must be signed in to change notification settings - Fork 10
Tutorial
- AWS - Amazon Web Services - a cloud computing services provider
- S3 - an AWS file storage service; Amazon S3 stores data (files) as objects within resources called "buckets"
- EMR - Elastic Map Reduce - an AWS cluster computing framework
This tutorial guides the user through the steps necessary to perform an analysis. It is assumed that any shell commands are executed using a bash shell.
🔴 Note that this tutorial will require you to either create a new AWS S3 bucket, or use an existing bucket. As a result, these instructions will require you to replace occurrences of the word yourbucket (or [YOUR BUCKET]) with the name of the bucket you have created or chosen to use - at various points in the instructions.
This tutorial is estimated to take up to an hour to complete. There is a step in this tutorial that launches the necessary computing resources on AWS. This step takes approximately 20 minutes to complete. Once verified that this step has started, the user can come back to the tutorial in 20 minutes to proceed with the remainder of the tutorial.
🔴 It is estimated that the cost of this tutorial, charged to your AWS account, will be less than $5 USD. Make sure you follow the instructions to terminate the session once you have completed the tutorial.
It is assumed the user has downloaded the Falco framework and has extracted the files to a location of their choice onto a Linux operating system environment. This location will be referred to as the local resource, and the directory on the local resource that contains the LICENSE file from the Falco framework will be referred to as the home directory. Files related to this tutorial can be found in the tutorial directory.
If your system does not have Python 3 installed, refer to https://python.org for the relevant documentation.
The Python library boto3 is required to be installed. boto3 provides a programming interface to AWS services. To install boto3 for Python 3, issue the following command (Debian/Ubuntu):
sudo python3 -m pip install boto3
If this command fails due to pip not being installed, the following commands work on Debian/Ubuntu related Linux systems:
# update system software
sudo apt-get update
# install pip for python 2
sudo apt-get install python-pip
# install pip for python 3
sudo apt-get install python3-pip
If you are not part of an organisation that can provide you with IAM credentials, as described in Section 1.3 below, you will need to create an AWS account. Note that for this and subsequent AWS setup steps, this Getting Started with AWS documentation will be helpful.
If you are not the AWS administrator, ask your AWS administrator for AWS Access Key ID and Secret Access Key.
The AWS CLI is a command line tool that interfaces to the AWS resources such as EC2 instances, EMR clusters, and S3 storage buckets.
sudo pip install awscli
Once installed, an initial configuration is required:
aws configure
When prompted, enter your Access Key ID and Secret Access Key, and default AWS region. Press Enter
when prompted
for output type if you are not sure about this element.
Obtain the AWS EC2 key name that will be used to control access to the instances that comprise the EMR cluster. A key will have an extension .pem. If you are familiar with AWS, you may already have an AWS EC2 key. Otherwise, you may create a key with these instructions
Create an S3 bucket for use with Falco.
aws s3api create-bucket --bucket [YOUR BUCKET] --region us-west-2
Replace [YOUR BUCKET] with your own bucket name. Also replace us-west-2 if you would prefer to use a different region. Once created, use the AWS CLI high level S3 commands to work with your bucket - e.g. copy files to or from your bucket, list the contents of your bucket, etc.
Additionally, AWS provides a web interface to S3 (and other services).
If you did not set up access to AWS yourself, you may have been provided a user name and password to access the AWS
management console. You will also require your account ID. The console can be accessed at the address (replace [My_AWS_Account_ID]
with your account ID):
https://[My_AWS_Account_ID].signin.aws.amazon.com/console/]
The sign-on screen should look something like:
The AWS management console has links to the various AWS services:
The following link shows the necessary information on how to set up Roles for AWS EMR: Create Default IAM Roles for Amazon EMR
🔴 Note that this is a one-time only step for your organisation - check that it has not already been completed.
aws emr create-default-roles
Make sure your user has the necessary permissions for both AWS EMR and AWS S3. For the sake of completing this tutorial, if you are not in the AWS Administrator group, you could configure or request that your user have the following IAM policies:
- AmazonS3FullAccess
- AmazonElasticMapReduceFullAccess
These policies can be configured from the AWS IAM Management console.
🔴 At this stage, if time or user resources are limited (e.g. your local computing resource has less than 32G of RAM, or you do not have a fast internet connection), you may prefer to skip the preparatory steps and go straight to Run the Tutorial. Otherwise, continue on to the next step.
In this tutorial, a human genome will be used. The file is ~800M in size.
In a work directory of your choice, create a genome directory, and download the files:
mkdir genome_ref
cd genome_ref
wget http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refFlat.txt.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/GRCh38.genome.fa.gz
wget ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.chr_patch_hapl_scaff.annotation.gtf.gz
# unzip the files
gunzip *.gz
Since Falco does not require the genome .fa file, you will only need to copy the two other reference
files to the AWS S3 bucket. At a bash prompt from the genome_ref
directory, execute the following command - first change
[YOUR BUCKET] to your own bucket name:
aws s3 sync . s3://[YOUR BUCKET]/falco-tutorial/genomes/hg38/genome_ref --exclude "*.fa" --exclude "*.fa.gz"
# change back to work directory
cd ..
In this step, the index for both STAR and HISAT2 are created.
Before proceeding with this step, ensure that your Linux system has the following dependencies installed: make, gcc, g++ glibc-static.
In the work directory:
wget -O STAR-2.5.2a.tar.gz https://github.com/alexdobin/STAR/archive/2.5.2a.tar.gz
tar -xzf STAR*.tar.gz
star_path=$( find . -name "STAR"|grep -E "/Linux_x86_64/" )
ln -s ${star_path%STAR} STAR
# create directory for star index
mkdir hg38_star_sparse_ref
STAR/STAR --runMode genomeGenerate --genomeDir hg38_star_sparse_ref/ --genomeFastaFiles genome_ref/GRCh38.genome.fa
--genomeSAsparseD 2 --runThreadN 8 --sjdbGTFfile genome_ref/gencode.v21.chr_patch_hapl_scaff.annotation.gtf
You can adjust the number of threads used by modifying the number after --runThreadN.
The hg38_sparse_ref/
directory should look something like:
File name |
---|
chrLength.txt |
chrNameLength.txt |
chrName.txt |
chrStart.txt |
exonGeTrInfo.tab |
exonInfo.tab |
geneInfo.tab |
Genome |
genomeParameters.txt |
SA |
SAindex |
sjdbInfo.txt |
sjdbList.fromGTF.out.tab |
sjdbList.out.tab |
transcriptInfo.tab |
aws s3 sync hg38_star_sparse_ref s3://[YOUR BUCKET]/falco-tutorial/genomes/hg38/star_index
replace [YOUR BUCKET]
with the name of your bucket.
🔴 This step is not necessary for the tutorial as the tutorial uses STAR for aligner. However, if you plan on using HISAT2 during the tutorial, you will need to do the following step.
In the work directory:
wget -O hisat2-2.0.4.zip ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/downloads/hisat2-2.0.4-Linux_x86_64.zip
unzip hisat2*.zip
hisat_dir=$( find . -maxdepth 1 -type d -name "hisat2*")
ln -s $hisat_dir hisat
# create directory for hisat index
mkdir hg38_hisat_index
hisat/hisat2-build -p 8 genome_ref/GRCh38.genome.fa hg38_hisat_index/hisat2.index
You can adjust the number of threads used by modifying the number after -p.
The hg38_hisat_index/
directory should look something like:
File name |
---|
hisat2.index.1.ht2 |
hisat2.index.2.ht2 |
hisat2.index.3.ht2 |
hisat2.index.4.ht2 |
hisat2.index.5.ht2 |
hisat2.index.6.ht2 |
hisat2.index.7.ht2 |
hisat2.index.8.ht2 |
aws s3 sync hg38_hisat_index s3://[YOUR BUCKET]/falco-tutorial/genomes/hg38/hisat_index
replace [YOUR BUCKET]
with the name of your bucket.
For this tutorial, a script - get_data.sh
is provided in the tutorial directory that will download a small number
of relatively small FASTQ files (the total download size is approximately 380M, and a total of 10 individual files).
The files are from the freely accessible Sequence Read Archive (SRA) database. To
copy download and copy the files to a specified AWS S3 location, use the command (issued from your work directory):
tutorial/get_data.sh s3://[YOUR BUCKET]/falco-tutorial/data
Replace [path-to-tutorial-directory] with the actual path to the tutorial directory and [YOUR BUCKET] with the name of your bucket.
Falco is a framework that enables users to run software in a distributed cloud environment. The analysis software components that are currently utilised by Falco include STAR, HISAT2, featureCount, HTSeq and Picard Tools. This tutorial also makes use of pre-processing software such as Trimmomatic. This software needs to be uploaded to your AWS S3 bucket - to enable Falco to install the software on the nodes of the EMR cluster.
There is a script located in the source/cluster_creator
directory that downloads the required software to your local resource (in a temporary directory) and then uploads the files to your S3 bucket. From the Falco home directory:
source/cluster_creator/prepare_install_files.sh s3://[YOUR BUCKET]/falco-tutorial/software_install
In this section the following icons will be used to flag tasks that may need completing:
- ❗ - this is a task that needs to be completed
⁉️ - this task may or may not be compulsory, and will depend on the circumstances as explained at the location of the icon
The tutorial files have a number of locations that have a placeholder called username and yourbucket. Change these placeholders to your values. In a bash shell, from the Falco home directory, edit and submit the following command: ❗
# preplace the bracketed sections with your details
sed -i.bak 's/username/[YOUR USER NAME]/g ; s/yourbucket/[YOUR BUCKET NAME]/g' tutorial/*.config
# EXAMPLE ONLY: if your user name if "fred" and your AWS bucket is "falco-test", then you would use the following command
#sed -i.bak 's/username/fred/g ; s/yourbucket/falco-test/g' tutorial/*.config
sed -i.bak.reg 's/us-west-2/[YOUR REGION]/g' tutorial/*.config
# change [YOUR REGION] to the name of your region
# EXAMPLE (if your region is us-west-1): sed -i.bak.reg 's/us-west-2/us-west-1/g' tutorial/*.config
The original .config
files in the tutorial directory will now have the file extension .config.bak
should you wish to
restore the original files.
❗ First open the file tutorial/emr_cluster.config
and check that the [EMR_nodes]
section is similar to the following:
[EMR_nodes]
key_name = yourkey
service_role = EMR_DefaultRole
instance_profile = EMR_EC2_DefaultRole
master_instance_type = r3.4xlarge
master_instance_count = 1
core_instance_type = r3.4xlarge
core_instance_count = 2
core_instance_spot = True
core_instance_bid_price = 1
The [EMR_nodes]
section contains a line that starts with key_name =
. The computing resources created by the AWS EMR
framework uses public–key cryptography to encrypt and decrypt login information. You need to supply your key name here.
For example, if your encryption key file is my-key-name.pem
, the corresponding line in the configuration file should
read key_name = my-key-name
. ❗ Go ahead and edit this entry - enter your key name..
🔴 Also be aware of the charges that you may incur from AWS for the creation of this cluster. The master instance will be an on-demand type, whilst the two core instances are spot instance types. It is estimated the cost of the cluster for this tutorial, if terminated within 1 hour, will be less than $5 USD. This cost is based on the AWS region us-west-2 - US West (Oregon).
When ready, in the home directory of the Falco code, issue the following command to start the cluster: ❗
python3 launch_cluster.py --config tutorial/emr_cluster.config
When the command is processed, the user will receive a response, of the form:
Cluster has been launched with ID j-1FDPU9CHN79W9
Make a note of your cluster ID for future reference.
❗ Monitor the status of the EMR cluster via the AWS EMR console. When the status of your cluster is Ready, you may proceed with the steps required for completing the analysis.
Click on your cluster to obtain more information about your cluster.
🔴 Since the nodes that are launched as part of this cluster use AWS spot instance types, it is possible that the market price for the instances exceeds the bid price. If this is the case, then the cluster will not start until the market price falls below the bid price. You can monitor the market price for the spot instances via the AWS EC2 Mangement Console. You can decide if you want to terminate your EMR cluster and either try again later, or modify the bid price in the file tutorial/emr_cluster.config
.
Falco requires a manifest file to list the FASTQ filenames representing the data input. The required format is a tab delimited text file. This file has been provided for this tutorial and is located at tutorial/data.manifest
. Upload the manifest file to your AWS S3 bucket: ❗
# issue this command from the Falco home directory
aws s3 cp tutorial/data.manifest s3://[YOUR BUCKET]/falco-tutorial/data.manifest
# replace [YOUR BUCKET] with the name of your bucket
🔴 The following three instructions that launch jobs can be issued one after the other - without waiting for the previous job to finish. The EMR framework will launch jobs in order, and only after the previous job has completed.
The split job takes the original data and splits it into smaller sized files for more efficient processing by
Falco. The original input data stored on AWS S3 will not be removed. The modified data will be stored in a new
AWS S3 location as specified in the configuration file tutorial/split_job.config
. Type the following command at a command prompt in the Falco home directory: ❗
python3 submit_split_job.py --config tutorial/split_job.config
In Falco, the pre-processing step is optional. However, for this tutorial, an example pre-processing script is provided, and this step is compulsory to complete the tutorial as configured.
First examine the configuration file tutorial/preprocessing_job.config
. The config file specifies which bash
scripts are used for the pre-processing. You may wish to also examine these scripts to see how the pre-processing works
in this case. Type the following at a command prompt from the Falco home directory: ❗
python3 submit_preprocessing_job.py --config tutorial/preprocessing_job.config
This is the main analysis job that processes the pre-processed data to determine the counts of features. The output be two .csv files: the actual counts of features, and a separate file detailing quality assurance statistics relating to these counts.
The configuration file tutorial/analysis_job.config
contains the settings for the analysis job with STAR as the aligner
and featureCount for quantification - including any extra parameters for the tools used. The configuration file
also specifies the AWS S3 location for the final output .csv files. Enter the following at a command prompt from the Falco home directory: ❗
python3 submit_analysis_job.py --config tutorial/analysis_job.config
🔴 To change the alignment and/or quantification tool used in the analysis job, simply modify the
aligner_tool
andcounter_tool
option in the analysis configuration file (analysis_job.config
).
Note that, as long as the above three steps (split, preprocess, and analysis) are entered in that order, the steps may be entered one after the other - without having to wait until the previous step finishes. The EMR framework ensures that a step does not start until the previously queued step has completed.
❗ Monitor the status of each step using the AWS EMR console - in the Step section.
Use the AWS S3 Console to monitor files in your S3 bucket.
To download a file from AWS S3 to your current directory, edit the following command with your details and execute at a shell command line: ❗
aws s3 cp s3://[YOUR BUCKET]/falco-tutorial/[YOUR USER NAME]/analysis/samples_expression.csv .
# you can also list all the AWS S3 files that begin with a particular prefix:
#aws s3 ls s3://[YOUR BUCKET]/falco-tutorial/[YOUR USER NAME]/analysis/
Alternatively, you could see a listing of the results via the AWS S3 Console by navigating to your bucket and output location.
🔴 As AWS charges by the hour for usage of its services, the user should terminate their cluster when finished. This is done by selecting your cluster on the AWS EMR console Cluster List, and clicking the Terminate button.