Merge pull request #27 from ambrosejcarr/develop

Develop
ambrosejcarr · Nov 10, 2015 · eeade49 · eeade49
2 parents 1757271 + 7e1e234
commit eeade49
Show file tree

Hide file tree

Showing 19 changed files with 1,618 additions and 408 deletions.
diff --git a/README.md b/README.md
@@ -1,2 +1,266 @@
-# seqc
-Single-Cell Sequencing Quality Control and Processing Software
+## SEquence Quality Control (SEQC -- /sek-si:/)
+
+### Overview:
+
+SEQC is a package that is designed to process sequencing data on the cloud. It also
+contains tools to analyze processed data, which has a much smaller memory footprint. 
+Thus, it should be installed both locally and on your remote server. However, it should
+be noted that some portions of the data pre-processing software require 30GB of RAM to 
+run, and are unlikely to work on most laptops and workstations.
+
+To faciliate easy installation and use, we have made available Amazon Machine Images
+(AMIs) that come with all of SEQC's dependencies pre-installed. In addition, we have
+uploaded the indices (-i/--index parameter, see "Running SEQC) and barcode data
+(-b/--barcodes) to public amazon s3 repositories. These links can be provided to SEQC and
+it will automatically fetch them prior to initiating an analysis run. 
+
+Amazon Web Services (AWS) is only accessible through a programmatic interface. To
+simplify the starting and stopping of amazon compute servers, we have written several
+plug-ins for a popular AWS interface, StarCluster. By installing starcluster, users can
+simply call starcluster start <cluster name>; starcluster sm <cluster name> for instant
+access to a compute server that is immediately ready to run SEQC on data of your choosing.
+
+### Installation \& Dependencies:
+
+#### Dependencies For Remotely Running on AWS:
+1. Amazon Web Services (optional): If you wish to run SEQC on amazon (AWS), 
+you must set up an AWS account (see below). SEQC will function on any machine running
+a nix-based operating system, but we only provide AMIs for AWS. 
+3. Starcluster (optional; install locally): If you run SEQC on aws, then you create aws
+compute instances with the python 2.7 
+<a href=https://github.com/jtriley/StarCluster>starcluster</a>
+package, which streamlines creation and shutdown of servers that are ready to run SEQC.
+Requires <a href=https://www.python.org/downloads/release/python-2710/>Python 2.7</a> 
+with <a href=http://pip.readthedocs.org/en/stable/installing/>pip</a>.
+
+        $> git clone git://github.com/jtriley/StarCluster.git
+        $> cd StarCluster
+        $> sudo python distribute_setup.py
+        $> sudo python setup.py install
+
+3. Install Starcluster plugins and config template from SEQC:
+
+        $> git clone https://github.com/ambrosejcarr/seqc.git
+        $> cp seqc/src/plugins/*.py ~/.starcluster/plugins/
+        $> cp seqc/src/plugins/starcluster.config ~/.starcluster/config
+
+#### Dependencies for Local Installation or Other Cloud Computing Platforms:
+1. <a href=https://www.python.org/downloads/>Python 3</a>
+2. <a href=https://www.hdfgroup.org/HDF5>libhdf5</a>, a highly efficient database used to
+store SEQC output.
+3. SEQC depends on several python3 packages that are automatically installed and updated.
+to view these packages, please view the `setup.py` file packaged with SEQC.
+
+### Setting up HDF5 on your local computer:
+#### Installing from Source:
+1. After downloading libhdf5 from source, it can be installed by typing:
+
+        $> ./configure --prefix=/usr/local/
+        $> make
+        $> make install
+
+2. Install pytables by typing: `pip3 install tables`
+
+#### Installing without previous configuration
+1. If you installed libhdf5 without giving arguments in the "configure" step, make sure
+that you have the necessary prereqs already installed:
+    * numpy
+    * numexpr
+    * cython
+2. Then set the $HDF_DIR environment variable by typing: 
+
+        $> export HDF_DIR=/your/installation/directory/for/hdf5
+
+3. You should now be able to install pytables: `pip3 install tables`
+
+### Setting up AWS, SEQC, and starcluster
+
+Once all dependencies have been installed, SEQC can be installed on any machine by typing:
+
+    $> git clone https://github.com/ambrosejcarr/seqc.git
+    $> pip3 install -e seqc/
+
+#### Setting up an AWS Account: 
+1. Navigate <a href=http://aws.amazon.com>here</a> and click “Create an AWS Account.”
+2. Enter your desired login e-mail and click the “I am a new user” radio button.
+3. Fill out the Login Credentials with your name and password.
+4. Fill out your contact information, read the AWS Customer Agreement, and accept if you
+wish to create an account.
+5. Save your AWS Access ID and Secret Key -- this information is very important!
+
+#### Create an RSA key to allow you to launch a cluster
+1. Sign into your AWS account and go to the EC2 Dashboard.
+2. Click “Key Pairs” in the NETWORK & SECURITY tab.
+3. Click “Create Key Pair” and give it a new name.
+4. This will install a new key called <keyname>.pem on your local machine. 
+5. Rename the key to an .rsa extension and move it to a secure directory.
+6. example: `<keyname>.rsa` and move it to a directory (e.g. `~/.ssh/`)
+7. Change the permission settings with `$> chmod 600 /path/to/key/keyname.rsa`
+
+#### Personalize the Dummy StarCluster Config File Provided by SEQC.
+1. Open the `~/.starcluster/config` file
+2. Under `[aws info]` enter the following information: 
+    1. `AWS_ACCESS_KEY_ID = #access_id` (This is your AWS Access ID from Step (1))
+    2. `AWS_SECRET_ACCESS_KEY = #secret_key` (This is your AWS Secret Key from Step (1))
+    3. `AWS_USER_ID= #user_id` (This is a numerical ID from AWS, found under IAM users)
+    4. Click on your username on the top right corner of the AWS dashboard and click
+    “My Account” -- your Account Id should pop up at the top of the page (a 12-digit 
+    number)
+3. Under Defining EC2 Keypairs:
+    1. rename `[key <your_key_name>]` to the name of the key you generate above.
+    2. change key location to the location of your `<keyname.rsa>` file:
+    `KEY_LOCATION=~/.ssh/<keyname>.rsa`
+5. Under templates, find `[cluster c3.large]`
+    1. change key to `<your_key_name>`
+
+#### Install and Configure AWS CLI (AWS Command Line Interface).
+1. You can install by typing `pip install awscli`
+2. Then, configure it by typing `aws configure`:
+    * AWS Access Key ID [*******]: `access_id`
+    * AWS Secret Access Key [*******]: `secret_key`
+    * Default region name [us-west-2]: `us-east-1` (Adjust accordingly)
+    * Default output format [None]: `text`
+
+#### Start a cluster:
+1. `$> starcluster start -c <template_name> <cluster_name>`
+2. Wait until the cluster is finished setting up. Then, the cluster can be accessed
+using:
+3. `$> starcluster sshmaster -X <cluster_name>`  (-X gives you x-window plotting capability)
+4. To exit the cluster, simply type “exit”.
+5. Other things like `starcluster stop <cluster_name>`, `terminate`, `start -c`, etc.
+6. You can also copy files to/from the cluster using the put and get commands. 
+To copy a file or entire directory from your local computer to the cluster:
+`$> starcluster put mycluster /path/to/local/file/or/dir /remote/path/`
+7. To copy a file or an entire directory from the cluster to your local computer:
+`$> starcluster get mycluster /path/to/remote/file/or/dir /local/path/`
+
+
+### Running SEQC:
+
+After SEQC is installed, help can be listed:
+
+    $> SEQC -h
+    usage: SEQC [-h]
+                {in-drop,drop-seq,mars-seq,cel-seq,avo-seq,strt-seq,index} ...
+
+    positional arguments:
+      {in-drop,drop-seq,mars-seq,cel-seq,avo-seq,strt-seq,index}
+                            library construction method types
+        in-drop             in-drop help
+        drop-seq            drop-seq help
+        mars-seq            mars-seq help
+        cel-seq             cel-seq help
+        avo-seq             avo-seq help
+        strt-seq            strt-seq help
+        index               SEQC index functions
+
+    optional arguments:
+      -h, --help            show this help message and exit
+
+
+Help on parsing individual data types can be obtained by typing:
+
+    $> SEQC in-drop -h
+    usage: SEQC in-drop [-h] [-i I] [-n N] [-o O] [-b B] [-f [F [F ...]]]
+                        [-r [R [R ...]]] [-s [S]] [-m M] [-l L]
+                        [--star-args SA [SA ...]] [--list-default-star-args]
+
+    optional arguments:
+      -h, --help            show this help message and exit
+
+    Required Arguments:
+      -i I, --index I       star alignment index folder. This folder will be
+                            created if it does not exist
+      -n N, --n-threads N   number of threads to run
+      -o O, --output-file O
+                            stem of filename in which to store output
+      -b B, --barcodes B    location of serialized barcode object.
+
+    Input Files:
+      pass one input file type: sam (-s), raw fastq (-f, [-r]), or processed
+      fastq (-m)
+
+      -f [F [F ...]], --forward [F [F ...]]
+                            forward fastq file(s)
+      -r [R [R ...]], --reverse [R [R ...]]
+                            reverse fastq file(s)
+      -s [S], --sam [S]     sam file(s) containing aligned, pre-processed reads
+      -m M, --merged-fastq M
+                            fastq file containing merged, pre-processed records
+
+    Optional arguments for disambiguation:
+      -l L, --frag-len L    the number of bases from the 3 prime end to consider
+                            when determining trancript overlaps
+
+    Optional arguments for STAR aligner:
+      --star-args SA [SA ...]
+                            additional arguments for STAR. Pass as arg=value
+                            without leading "--". e.g. runMode=alignReads
+      --list-default-star-args
+                            list SEQDB default args for the STAR aligner
+
+
+All SEQC runs require that you pass a SEQC index (`-i/--index`). These are STAR indices,
+augmented by SEQC-specific files:
+
+1. `annotations.gtf`: a modified GTF file, containing truncated sequence sizes that
+reflect that we expect all data to fall within ~ 1kb of the transcriptional termination
+sites. In addition, transcripts are tagged with "SCIDs", identifiers that merge
+transcripts and genes which cannot be distinguished in the ~ 1kb of sequence that we
+expect to observe.
+2. `p_coalignments_array.p`: a binary file containing, for each SCID, the probability of
+observing a co-alignment to other genes.
+
+Human and mouse indices can be found on our aws s3 bucket at
+`s3://dplab-data/genomes/mm38/` and `s3://dplab-data/genomes/hg38`. These indices
+are built from recent releases of ENSEMBL genomes. These links can be passed directly to
+SEQC, which will download them before beginning the analysis
+
+If new indices must be generated, these can be produced by the SEQC index method:
+
+    $> SEQC index -h
+    usage: SEQC index [-h] [-b] [-t] -o O [O ...] -i I [-n N] [--phix]
+
+    optional arguments:
+      -h, --help            show this help message and exit
+      -b, --build           build a SEQC index
+      -t, --test            test a SEQC index
+      -o O [O ...], --organism O [O ...]
+                            build index for these organism(s)
+      -i I, --index I       name of folder where index should be built or
+                            containing the index to be verified
+      -n N, --n-threads N   number of threads to use when building index
+      --phix                add phiX to the genome index and GTF file.
+     
+     $> # for example, to build a mouse index with phiX features added to mm38, in a
+     $> $ folder called 'mouse', using 7 threads
+     $> SEQC index -b -o mm38 -i mouse -n 7 --phix
+
+Some data types require serialized barcode objects (`-b/--barcodes`). These objects contain
+all of the barcodes for an experiment, as they would be expected to be observed.
+For example, if you expect to observe the reverse complement of the barcodes you used to
+construct the library, then this object should be built from reverse complements.   
+
+These barcode files can be found at `s3://dplab-data/barcodes/`. If you need to generate
+a new barcode object, this can be accomplished with the built-in `PROCESS_BARCODES`
+utility:
+
+    $> PROCESS_BARCODES -h
+    usage: PROCESS_BARCODES [-h] [-o O] [-b B [B ...]] [-p P]
+                            [--reverse-complement]
+
+    optional arguments:
+      -h, --help            show this help message and exit
+      -o O, --output_stem O
+                            name and path for output file
+      -b B [B ...], --barcode_files B [B ...]
+                            barcode files
+      -p P, --processor P   type of experiment barcodes will be used in
+      --reverse-complement  indicates that barcodes in fastq files are reverse
+                            complements of the barcodes found in barcode files
+
+Example usage:
+
+`$> PROCESS_BARCODES -o ./in_drop_barcodes -b <barcode_file> -p in-drop --reverse-complement`
+would save a new, reverse-complemented barcode object built from `<barcode_file>` at
+`./in_drop_barcodes.p`
diff --git a/setup.py b/setup.py
@@ -1,6 +1,25 @@
 __author__ = 'Ambrose J. Carr'
 
 from setuptools import setup
+from warnings import warn
+import os
+import shutil
+
+# pip3 cannot install external dependencies for python; warn user if external dependencies
+# are missing; do this at the end so that the users are more likely to see it.
+
+# look in /usr/local/ and /usr/local/hdf5/ for hdf5 libraries;
+# if found in /usr/local/hdf5/, set an environment variable to help pip3 install it.
+h5fail = True
+if os.path.isfile('/usr/lib/libhdf5.so'):
+    h5file = False
+elif os.path.isfile('/usr/local/lib/libhdf5.so'):
+    h5fail = False
+elif os.path.isfile('/usr/hdf5/lib/libhdf5.so'):
+    os.environ['HDF5_DIR'] = '/usr/hdf5/'
+elif os.path.isfile('/usr/local/hdf5/lib/libhdf5.so'):
+    os.environ['HDF5_DIR'] = '/usr/local/hdf5/'
+    h5fail = False
 
 setup(name='seqc',
       version='0.1',
@@ -12,17 +31,34 @@
       packages=['seqc', 'seqc.sa_postprocess', 'seqc.sa_preprocess', 'seqc.sa_process'],
       install_requires=[
           'numpy>=1.10.0',
+          'cython>0.14',  # tables requirement
+          'numexpr>=2.4',  # tables requirement
           'pandas>=0.16.0',
           'matplotlib>=1.4.3',
           'seaborn',
           'scipy>=0.14.0',
           'boto3',
           'pyftpdlib',
-          'intervaltree'],
+          'intervaltree',
+          'tables'],
       scripts=['src/scripts/SEQC',
                'src/scripts/PROCESS_BARCODES',
                'src/scripts/TEST_BARCODES',
                'src/scripts/process_multi_file_scseq_experiment.py',
                'src/scripts/process_single_file_scseq_experiment.py'],
       )
 
+# print any warnings
+if h5fail:
+    warn("""
+SEQC: libhdf5 shared library "libhdf5.so" not found in /usr/local/lib/,
+/usr/lib/, /usr/hdf5/lib/, or /usr/local/lib/hdf5/.
+tables will not find h5lib and installation will likely fail unless the
+HDF5_DIR environment variable has been set to the location that HDF5 was
+installed into. If HDF5 is not installed, please install it prior to
+installing SEQC.
+""")
+
+# look for star
+if not shutil.which('STAR'):
+    warn('SEQC: STAR is not installed. SEQC will not be able to align files.')
diff --git a/src/plugins/gitpull.py b/src/plugins/gitpull.py
@@ -0,0 +1,22 @@
+from starcluster.clustersetup import ClusterSetup
+from starcluster.logger import log
+from subprocess import check_output, call
+
+
+class DownloadRepo(ClusterSetup):
+    def __init__(self, dir_name='/data/software/'):
+        if not dir_name.endswith('/'):
+            dir_name += '/'
+        self.dir_name = dir_name
+
+    def run(self, nodes, master, user, user_shell, volumes):
+        folder = self.dir_name
+        log.info('installing seqc repo onto %s' % folder)
+
+        master.ssh.execute("mkdir %s" % folder)
+        location = folder + "seqc.tar.gz"
+        master.ssh.execute(
+            'curl -H "Authorization: token a22b2dc21f902a9a97883bcd136d9e1047d6d076" -L '
+            'https://api.github.com/repos/ambrosejcarr/seqc/tarball > %s' % location)
+        log.info("seqc.tar.gz has been downloaded in /data/software directory")
+        master.ssh.execute('pip3 install %s' % location)