-
Notifications
You must be signed in to change notification settings - Fork 22
Install
##Introduction
This document provides a general overview of the system requirements and installation procedures for the GMS. A beginner's guide to installation is also available.
To install the GMS you will need root/sudo access and a fast internet connection to download all packages and demonstration data sets (150 - 500 Gb total depending on which data set you select).
Minimum System requirements for processing the full example data through all pipelines:
- 100 GB for reference-related data used by pipelines
- 284 GB for test data
- 1 TB for the results (40x WGS tumor/normal, 1 lane of exome, 1 lane of tumor RNA, processing through MedSeq)
- 1 TB of /tmp space
- 48+ GB of RAM
- 12+ cores
- 2 weeks of processing time for full analysis (varies)
For running on full-sized data we recommend a single high-performance blade with:
- ~80 processors
- ~1 TB ram
- High performance storage system
Or, a high performance cluster.
For testing or review purposes using the down-sampled demonstration data sets we recommend (at minimum):
- Dual quad core @ 2.4GHz (64-bit)
- 16 GB ram
- 2 TB 7200rpm 6Gb/s storage
Refer to the Supplementary Materials of the GMS manuscript for detailed examples of test systems we used for benchmarking.
Be sure to log out and back in after install. You will need the system to recognize that you are in the "genome" group.
For a standard, standalone, configuration on Ubuntu 12.04 run:
sudo apt-get install git ssh make
git clone https://github.com/genome/gms.git
cd gms
make
Once the installation completes make sure to log out and log in again to ensure your user permissions are set properly.
Installation on another platform requires a virtual machine (VM). On a POSIX system that supports vagrant a Vagrant/VirtualBox install can be automatically performed. The provides advantages over the third option below, usable on Windows, because the Vagrant configuration can be extended to manage a whole cluster on Amazon, OpenStack, VMWare, etc.
NOTE: You must have sudo access on the host machine to use this option.
NOTE: You must install git, make, and ssh on your system before doing the following.
This is the recommended approach for running on Mac OS X. Be sure to install Xcode first.
git clone https://github.com/genome/gms.git
cd gms
make vminit # install virtualbox and vagrant, and an Ubuntu 12.04 VM
vagrant ssh # log into the VM
make # install the gms
Once the virtual machine is created successfully if you want to reboot the host system, you should
log out of the VM and use 'vagrant suspend' to shutdown the VM, then 'vagrant resume' to reboot it. Note that instead of logging in with vagrant ssh
you could have used ssh -p 2200 vagrant@127.0.0.1
and entered vagrant when prompted for a password
. However, vagrant ssh
is always the most convenient way to log in to the VM as this will automatically determine port, credentials, etc.
For performance reasons, you should customize the resources available within you virtual machine guest instance. Refer to Working with GMS virtual machines (vagrant virtualbox) for detailed instructions and examples.
Installation on Windows, on Mac OS X without Xcode, or any other system that supports virtual machines:
All other systems, including Windows, VirtualBox (or another VM provider) can be installed manually.
VirtualBox can be downloaded here:
https://www.virtualbox.org/wiki/Downloads
Download the correct ISO image for Ubuntu 12.04 (Precise) Either the Desktop or Server versions will work.
http://releases.ubuntu.com/precise/
Follow these instructions to install the image into VirtualBox:
http://www.wikihow.com/Install-Ubuntu-on-VirtualBox
On your VM, follow the standard Ubuntu 12.04 directions above.
sudo apt-get install git ssh make
git clone https://github.com/genome/gms.git
cd gms
make
Once the installation completes make sure to log out and log in again to ensure your user permissions are set properly.
For more complex configurations, like install on a cluster or cloud servers, edit the file "Vagrantfile", and use Amazon EC2 or OpenStack vagrant plugins. Management of the cloud services can be done from any host that supports vagrant.
An upcoming release will offer more support for managing the cluster.
For now Linux administration expertise and Vagrant expertise is required to make a cluster. This system runs daily on a 4000 node cluster with 15PB of network attached storage at The Genome Institute. Scalability beyond this point has not been measured.
The following checks can be made after logging into the GMS:
lsid # You should see the openlava cluster identification
lsload # You should see a report of available resources
bjobs # You should not have any unfinished jobs yet
bsub 'sleep 60' # You should be able to submit a job to openlava (run bjobs again to see it)
bhosts # You should see one host
bqueues # You should see four queues
genome disk group list # You should see four disk groups
genome disk volume list # You should see at least one volume for your local drive
genome sys gateway list # You should see two gateways, one for your new home system and one for the test data "GMS1"
Each GMS has a unique ID:
cat /etc/genome/sysid
echo $GENOME_SYS_ID
The entire installation lives in a directory with the ID embedded:
echo $GENOME_HOME # /opt/gms/$GENOME_SYS_ID
The initial system has one node, and that node has only its local disk on which to perform analysis.
To expand the system to multiple nodes, add disks, or use network-attached storage, see the System Expansion Guide.
To install an example set of human cancer data, including reference sequences and annotation data sets, first log into the system, then move into the gms repo directory (e.g. cd /vagrant
or cd /vagrant/gms
or cd ~/gms
or cd /opt/src/gms
), and finally run the command below. (Replace 'X' with the number of Gigabytes of physical memory available. Make sure that the metadata file is in the ./setup/metadata/
directory, the name of this file may be updated in the future depending on the model the metadata was generated from.)
./setup/prime-system.pl --data=hcc1395_1tenth_percent --sync=tarball --low_resources --memory=Xgb --metadata=./setup/metadata/18177dd5eca44514a47f367d9804e17a.dat
After running prime-system.pl
, log out and log back in. Some environment settings may be changed during this process that will not take effect until you log in again.
The example above will download a down-sampled demonstration data set. Valid values of --data
are hcc1395
, hcc1395_1percent
, hcc1395_1tenth_percent
, and none
. If you chose the full hcc1395
dataset consisting of whole genome, exome and transcriptome data for tumor and normal, this data set is 313 GB. It may consume considerable bandwidth and be very slow to install.
The example above assumes you are testing on a system with limited resources (--low_resources
) and X
GB of memory (--memory=Xgb
). If you are on a large server you should drop these two options. Otherwise, specify the memory available on your system or within your VM if applicable. See ./setup/prime-system.pl --help
for details.
You can now test some basic genome
commands and perform some queries of the database.
# list the data you just imported
genome taxon list
genome individual list
genome sample list
genome library list
genome instrument-data list solexa
# list the pre-defined models (no results yet ... you will launch these and generate results)
genome model list
# list the processing profiles associated with those models
genome processing-profile list reference-alignment
genome processing-profile list somatic-variation
genome processing-profile list rna-seq
genome processing-profile list differential-expression
genome processing-profile list clin-seq
To build the genotype microarray models:
genome model build start "name='hcc1395-normal-snparray'"
genome model build start "name='hcc1395-tumor-snparray'"
When you start a build, an entry is added to the database and a unique build id will be printed to the screen. Once this happens, the actual data processing is scheduled to happen in the background, allowing you to continue working at your terminal prompt. Using the $build_id
that is printed to the screen you can monitor progress of the analysis status of any build as follows:
genome model build view '$build_id'
Or using the model name to retrieve the same status update as follows:
genome model build view model.name='hcc1395-tumor-snparray'
To build the WGS tumor, WGS normal, exome tumor, and exome normal data, wait until the above finish, then run:
genome model build start "name='hcc1395-normal-refalign-exome'"
genome model build start "name='hcc1395-tumor-refalign-exome'"
genome model build start "name='hcc1395-normal-refalign-wgs'"
genome model build start "name='hcc1395-tumor-refalign-wgs'"
While those are building, you can run the RNA-Seq models:
genome model build start "name='hcc1395-normal-rnaseq'"
genome model build start "name='hcc1395-tumor-rnaseq'"
To build the WGS somatic and exome somatic models, wait until the ref-align models above complete, and then run:
genome model build start "name='hcc1395-somatic-exome'"
genome model build start "name='hcc1395-somatic-wgs'"
To build the differential expression models, wait until the rna-seq models above complete, and then run:
genome model build start "name='hcc1395-differential-expression'"
When all of the above complete, the MedSeq pipeline can be run:
genome model build start "name='hcc1395-clinseq'"
To view the inputs to any model, you can do something like the following:
genome model input show --model="hcc1395-clinseq"
To view the status of all builds, run:
genome model build list
To monitor progress of any particular build, run:
genome model build view "id='$BUILD_ID'"
To examine results, go to the build directory listed above, or list it specifically:
genome model build list --filter "id='$BUILD_ID'" --show id,data_directory
To import new data:
*sample importer coming soon*
To make a new set of models for that data once imported, this tool will walk you through the process interactively:
# use the common name you used during import ("individual.common_name")
genome model clin-seq advise --individual "common_name = 'TST1'"
The GMS presumes that other GMS installations are untrusted by default, and that users on the same GMS are trusted by default. This allows each installation to make decisions about the balance of security and convenience as suits its needs, and to change those decisions over time.
Independent GMS installations lean entirely on standard Unix/Linux permissions and sharing facilities (SSH, NFS, etc.), and are as secure as those facilities. Another GMS cannot access your data any more than a random user on the internet could, but the system is configured to allow sharing to be as convenient and granular as the administrators prefer later.
Within a GMS instance, all users are in a single Unix group, and all directories are writable by that group. If a given group of users cannot be trusted to this level, it is best to install independent systems, and use the "federation" facilities to control sharing. In the native environment of the GMS at Washington University, The Genome Institute uses one system for a staff of several hundred (combining programmers, analysts and laboratory staff), and with isolated instances in preparation only for medical diagnostics.
In a hierarchical organization, a group of individual GMS installations can export metadata to a larger GMS, without copying it, providing centralization of metadata, while distributing load, and keeping data in cost centers.
At its most extreme, in an environment that requires per-user security, each user could install a GMS independently, and use the "federation" capabilities to attach the systems of co-workers, share data, and perform larger scale analysis entirely within that framework.