-
Notifications
You must be signed in to change notification settings - Fork 22
Sharing GMS results using an Amazon AWS Instance
Suppose you want to share the results of an entire whole genome, exome and/or transcriptome analysis. This might be released publicly along with a manuscript to allow readers to scrutinize the entire analysis process and explore the results in more detail than is typically possible in a printed article. Or you might want to release the results to specific collaborators prior to publication or share the results for some other purpose.
The simplest way to release a complete GMS result is to perform the analysis on a cloud computing resource such as Amazon AWS EC2 and then make that instance available.
The following tutorial describes how to share a complete GMS result including the raw input data, the processing profiles you used, the GMS software code used to run the pipelines, all bioinformatic tools used, the reference genome sequences, gene annotations, builds containing the final results, etc.
This assumes that you have already completed the installation and analysis as described in the Beginner's guide to installation on Amazon AWS and completed the analysis you wish to share as described in the Beginner's guide to the demonstration analysis. As you will see below, in addition to sharing the results on an Amazon AWS instance, you can also share the login credentials which means that you could allow your collaborator to login and perform these analyses as well.
When you install the GMS, a custom Genome Web Viewer is automatically installed and configured to allow you to browse entities of the GMS. These include models, processing profiles, instrument data, subjects, and builds. This service is automatically running on your instance and providing access to it is as simple as configuring the security of your instance and providing a URL.
In order for outside users to log into the instance you will have to explicitly allow this in the 'Security Group' settings for the instance you wish to share. To do this, follow these steps.
First log into the AWS EC2 console and view your running instances:
Next, find the instance you wish to share and determine what Security Group is being applied to it. You will need to edit this security group to make sure incoming web access is permitted. You can get to these settings by either clicking the name of the security group in the description section of the instance you wish to share or you can note the name and follow the 'Security Groups' link under 'Network and Security' in list of options in the left sidebar of the console. If you do that you will see something like this:
In this example I have already created a security group called 'SGMS_HTTP-and-SSH'. Make sure your security group is selected by clicking on that row in the console. Next, to allow incoming web access you will need to select the 'Inbound' tab, then 'Edit', then 'Add Rule' and select 'HTTP' from the drop down menu.
Note that when you modify the security group of a running instance you will have to reboot that instance for it to take effect. You can reboot the instance from within a terminal session by typing sudo reboot
or from the EC2 console by going back to the 'Instances' page, right clicking the row for your instance and selecting 'Reboot' under 'Actions'.
Once the instance reboots. Find the 'Public DNS' or 'Public IP' for your instance. These can be found in the description section for the instance after selecting 'Instances' and clicking on the row for the instance you are sharing.
Enter either the 'Public DNS' or 'Public IP' in a browser window. You will land on the home page for the GMS Web Viewer running on your GMS instance. You can share this link with your collaborators and they should be able to see everything you see. For example: http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com/
In the top right hand corner of this view is the unique GMS system ID created when you installed the GMS. To view a particular processing profile, you can select that tab and select it. For example, the default exome somatic-variation processing profile.
To view results, in this case for the example HCC1395 analysis follow these steps. Go back to the GMS home page by clicking on you GMS ID or 'Genome Modeling System' at the top of the page. Then select the 'Builds' tab. Then enter 'hcc1395' in the 'Filter results' box. You may also want to increase the number of records shown per page. You can also click on any column header to sort the table by those values. For example, in the following screenshot I have sorted by 'Model'.
To see some results, try clicking on the build ID for the model 'hcc1395-clinseq'. This page will show you a detailed summary of this build, its inputs, workflow stages, etc. To browse results follow the 'data' link and enter the 'TST1' directory.
All of the clin-seq (aka 'med-seq') results are available here. Note that in this example we used the downsampled data instead of the complete data set. So the results will be conceptually identical to those described in the GMS manuscript but will not match exactly. We could have easily shared results from the complete analysis but it would cost more to maintain the Amazon instance persistently.
Individual results can now be browsed and shared by URL. For example the circos plot for HCC1395 is here: http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com/opt/gms/U6GNV74/fs/U6GNV74/info/model_data/18177dd5eca44514a47f367d9804e17a/build110d56825c354097a11b97d52f96aef5/TST1/circos/circos.png
A complete report of annotated SNVs and Indels with supporting read counts in Excel format can be found here: http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com/opt/gms/U6GNV74/fs/U6GNV74/info/model_data/18177dd5eca44514a47f367d9804e17a/build110d56825c354097a11b97d52f96aef5/TST1/snv_indel_report/TST1_final_filtered_coding_clean.xls
A detailed description of result files in this build and the other build types can be found here: Location and description of results files in GMS pipelines
Some GMS build results contain links to data hosted through the GMS webviewer. This allows (for example) the creation of IGV session files with resource paths to bam files hosted through the web server. This can be convenient for sharing/accessing data remotely. To configure the GMS to link to web-served files edit the following environment variables in /etc/genome.conf
from localhost to the appropriate public domain name. WARNING: BE VERY CAREFUL TO CONSIDER THE SECURITY/SENSITIVITY OF YOUR DATA BEFORE EXPOSING IT TO THE WWW IN THIS MANNER.
export GENOME_SYS_SERVICES_SEARCH_URL='http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com'
export GENOME_SYS_SERVICES_WEB_VIEW_URL='http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com'
export GENOME_SYS_SERVICES_FILES_URL='http://ec2-52-10-204-88.us-west-2.compute.amazonaws.com'
In order to allow users to log into your GMS instance and use command line GMS commands to perform analyses or explore existing results, you will need to do two things.
First, any user that wishes to log in will need the SSH 'key pair' specified when you originally created the instance. You will have save this '.pem' file somewhere and you will need to share it with others. It is up to you to decide how to safely share these keys with collaborators. For demonstration purposes only, we will allow you to download an example key pair here and log into the running instance described above. When you create key pairs you may decide to create a new key pair for each project, set of collaborators, etc. The key pair associated with any running instance can be found by going to the instances section of the EC2 console, selecting the desired instance, and viewing the 'Key pair name' in the description pane of the console. To view all of your existing key pairs you can follow the 'Key Pairs' link under 'Network & Security' in the navigation bar at the left of the console. This will generate a view like the following:
Second, as we did with the security groups above for HTTP (Web) access you will need to modify the 'Security Group' associated with the instance to ensure that incoming SSH access is permitted. To do this, view select the security group for your instance under 'Security Groups' in the EC2 console, select the 'Inbound' tab, select 'Edit', then 'Add Rule', and select 'SSH' from the drop down menu.
As previously, if you update the security group applied to a running instance you will need to reboot the instance before it is applied. Once the security settings are applied any user with the '.pem' key pair file and the public domain name (or public IP) will be able to log in. For example, to log into the example instance we have left running you can do the following in a terminal session.
#Download the key pair file using wget, modify permissions of this file and log in as 'ubuntu'
wget https://raw.githubusercontent.com/genome/gms/ubuntu-12.04/setup/aws/sgms-public-key1.pem
chmod 600 sgms-public-key1.pem
ssh -i sgms-public-key1.pem ubuntu@52.10.204.88
#Now perform a few simple queries to test the GMS
genome model list
genome model build list --show id,model_id,model_name,status
genome model build view model.name=hcc1395-clinseq
To further explore results and GMS commands in this demonstration instance, please refer to additional tutorials such as:
- Beginner's-Guide-to-the-Demonstration-Analysis
- Commonly-used-GMS-Commands
- Location-and-description-of-results-files-in-GMS-pipelines
Note that all of the above security settings can be easily configured when you first create the EC2 instance. You can create a 'Key Pair' with sharing in mind and name it in a descriptive way for the project. You can also create a custom 'Security Group' and configure it for the sharing strategy you would like.
The Amazon EC2 console allows considerable flexibility in the configuration of firewall rules for incoming and outbound network traffic. You may wish to further customize the rules for SSH and HTTP or add additional rules for additional services. For example, you might expose the postgres service and allow users to query the an SQL database directly. For more details on advanced configuration of EC2 security groups you can refer to the Amazon EC2 Security Groups Tutorial
There are serious security implications for several of the actions described above. If you store sensitive data regarding human subjects (including but not limited to raw genome sequence data) on an instance and that instance is not secure, you may have created a privacy concern. Before hosting human data in this fashion you should consult the policies of your institute.
In general (and despite the example above) you should be very careful about sharing the key pair files for an instance. You should also limit the incoming services you allow. Since it is possible to highly configure the security of an Amazon EC2 instance it should in theory be possible to securely share data with targeted collaborators.
Issues of human genome sequencing, privacy, informed consent, return of results to patients, etc. are in a rapidly evolving state. Amazon has documentation relating to these areas that are available upon request or via their website. These areas are of great interest to us. For background reading in the area you might refer to documentation on: HIPAA, GINA, CLIA, the ASHG Policy and Position Statement Archive, the NCHPEG, the NHGRI genetic discrimination overview, and so on.
Testing of the GMS on Amazon AWS EC2 and development of this documentation was generously supported by Amazon AWS Education grants.