Skip to content

Commit

Permalink
Merge pull request #27 from ambrosejcarr/develop
Browse files Browse the repository at this point in the history
Develop
  • Loading branch information
ambrosejcarr committed Nov 10, 2015
2 parents 1757271 + 7e1e234 commit eeade49
Show file tree
Hide file tree
Showing 19 changed files with 1,618 additions and 408 deletions.
268 changes: 266 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,266 @@
# seqc
Single-Cell Sequencing Quality Control and Processing Software
## SEquence Quality Control (SEQC -- /sek-si:/)

### Overview:

SEQC is a package that is designed to process sequencing data on the cloud. It also
contains tools to analyze processed data, which has a much smaller memory footprint.
Thus, it should be installed both locally and on your remote server. However, it should
be noted that some portions of the data pre-processing software require 30GB of RAM to
run, and are unlikely to work on most laptops and workstations.

To faciliate easy installation and use, we have made available Amazon Machine Images
(AMIs) that come with all of SEQC's dependencies pre-installed. In addition, we have
uploaded the indices (-i/--index parameter, see "Running SEQC) and barcode data
(-b/--barcodes) to public amazon s3 repositories. These links can be provided to SEQC and
it will automatically fetch them prior to initiating an analysis run.

Amazon Web Services (AWS) is only accessible through a programmatic interface. To
simplify the starting and stopping of amazon compute servers, we have written several
plug-ins for a popular AWS interface, StarCluster. By installing starcluster, users can
simply call starcluster start <cluster name>; starcluster sm <cluster name> for instant
access to a compute server that is immediately ready to run SEQC on data of your choosing.

### Installation \& Dependencies:

#### Dependencies For Remotely Running on AWS:
1. Amazon Web Services (optional): If you wish to run SEQC on amazon (AWS),
you must set up an AWS account (see below). SEQC will function on any machine running
a nix-based operating system, but we only provide AMIs for AWS.
3. Starcluster (optional; install locally): If you run SEQC on aws, then you create aws
compute instances with the python 2.7
<a href=https://github.com/jtriley/StarCluster>starcluster</a>
package, which streamlines creation and shutdown of servers that are ready to run SEQC.
Requires <a href=https://www.python.org/downloads/release/python-2710/>Python 2.7</a>
with <a href=http://pip.readthedocs.org/en/stable/installing/>pip</a>.

$> git clone git://github.com/jtriley/StarCluster.git
$> cd StarCluster
$> sudo python distribute_setup.py
$> sudo python setup.py install

3. Install Starcluster plugins and config template from SEQC:

$> git clone https://github.com/ambrosejcarr/seqc.git
$> cp seqc/src/plugins/*.py ~/.starcluster/plugins/
$> cp seqc/src/plugins/starcluster.config ~/.starcluster/config

#### Dependencies for Local Installation or Other Cloud Computing Platforms:
1. <a href=https://www.python.org/downloads/>Python 3</a>
2. <a href=https://www.hdfgroup.org/HDF5>libhdf5</a>, a highly efficient database used to
store SEQC output.
3. SEQC depends on several python3 packages that are automatically installed and updated.
to view these packages, please view the `setup.py` file packaged with SEQC.

### Setting up HDF5 on your local computer:
#### Installing from Source:
1. After downloading libhdf5 from source, it can be installed by typing:

$> ./configure --prefix=/usr/local/
$> make
$> make install

2. Install pytables by typing: `pip3 install tables`

#### Installing without previous configuration
1. If you installed libhdf5 without giving arguments in the "configure" step, make sure
that you have the necessary prereqs already installed:
* numpy
* numexpr
* cython
2. Then set the $HDF_DIR environment variable by typing:

$> export HDF_DIR=/your/installation/directory/for/hdf5

3. You should now be able to install pytables: `pip3 install tables`

### Setting up AWS, SEQC, and starcluster

Once all dependencies have been installed, SEQC can be installed on any machine by typing:

$> git clone https://github.com/ambrosejcarr/seqc.git
$> pip3 install -e seqc/

#### Setting up an AWS Account:
1. Navigate <a href=http://aws.amazon.com>here</a> and click “Create an AWS Account.”
2. Enter your desired login e-mail and click the “I am a new user” radio button.
3. Fill out the Login Credentials with your name and password.
4. Fill out your contact information, read the AWS Customer Agreement, and accept if you
wish to create an account.
5. Save your AWS Access ID and Secret Key -- this information is very important!

#### Create an RSA key to allow you to launch a cluster
1. Sign into your AWS account and go to the EC2 Dashboard.
2. Click “Key Pairs” in the NETWORK & SECURITY tab.
3. Click “Create Key Pair” and give it a new name.
4. This will install a new key called <keyname>.pem on your local machine.
5. Rename the key to an .rsa extension and move it to a secure directory.
6. example: `<keyname>.rsa` and move it to a directory (e.g. `~/.ssh/`)
7. Change the permission settings with `$> chmod 600 /path/to/key/keyname.rsa`

#### Personalize the Dummy StarCluster Config File Provided by SEQC.
1. Open the `~/.starcluster/config` file
2. Under `[aws info]` enter the following information:
1. `AWS_ACCESS_KEY_ID = #access_id` (This is your AWS Access ID from Step (1))
2. `AWS_SECRET_ACCESS_KEY = #secret_key` (This is your AWS Secret Key from Step (1))
3. `AWS_USER_ID= #user_id` (This is a numerical ID from AWS, found under IAM users)
4. Click on your username on the top right corner of the AWS dashboard and click
“My Account” -- your Account Id should pop up at the top of the page (a 12-digit
number)
3. Under Defining EC2 Keypairs:
1. rename `[key <your_key_name>]` to the name of the key you generate above.
2. change key location to the location of your `<keyname.rsa>` file:
`KEY_LOCATION=~/.ssh/<keyname>.rsa`
5. Under templates, find `[cluster c3.large]`
1. change key to `<your_key_name>`

#### Install and Configure AWS CLI (AWS Command Line Interface).
1. You can install by typing `pip install awscli`
2. Then, configure it by typing `aws configure`:
* AWS Access Key ID [*******]: `access_id`
* AWS Secret Access Key [*******]: `secret_key`
* Default region name [us-west-2]: `us-east-1` (Adjust accordingly)
* Default output format [None]: `text`

#### Start a cluster:
1. `$> starcluster start -c <template_name> <cluster_name>`
2. Wait until the cluster is finished setting up. Then, the cluster can be accessed
using:
3. `$> starcluster sshmaster -X <cluster_name>` (-X gives you x-window plotting capability)
4. To exit the cluster, simply type “exit”.
5. Other things like `starcluster stop <cluster_name>`, `terminate`, `start -c`, etc.
6. You can also copy files to/from the cluster using the put and get commands.
To copy a file or entire directory from your local computer to the cluster:
`$> starcluster put mycluster /path/to/local/file/or/dir /remote/path/`
7. To copy a file or an entire directory from the cluster to your local computer:
`$> starcluster get mycluster /path/to/remote/file/or/dir /local/path/`


### Running SEQC:

After SEQC is installed, help can be listed:

$> SEQC -h
usage: SEQC [-h]
{in-drop,drop-seq,mars-seq,cel-seq,avo-seq,strt-seq,index} ...

positional arguments:
{in-drop,drop-seq,mars-seq,cel-seq,avo-seq,strt-seq,index}
library construction method types
in-drop in-drop help
drop-seq drop-seq help
mars-seq mars-seq help
cel-seq cel-seq help
avo-seq avo-seq help
strt-seq strt-seq help
index SEQC index functions

optional arguments:
-h, --help show this help message and exit


Help on parsing individual data types can be obtained by typing:

$> SEQC in-drop -h
usage: SEQC in-drop [-h] [-i I] [-n N] [-o O] [-b B] [-f [F [F ...]]]
[-r [R [R ...]]] [-s [S]] [-m M] [-l L]
[--star-args SA [SA ...]] [--list-default-star-args]

optional arguments:
-h, --help show this help message and exit

Required Arguments:
-i I, --index I star alignment index folder. This folder will be
created if it does not exist
-n N, --n-threads N number of threads to run
-o O, --output-file O
stem of filename in which to store output
-b B, --barcodes B location of serialized barcode object.

Input Files:
pass one input file type: sam (-s), raw fastq (-f, [-r]), or processed
fastq (-m)

-f [F [F ...]], --forward [F [F ...]]
forward fastq file(s)
-r [R [R ...]], --reverse [R [R ...]]
reverse fastq file(s)
-s [S], --sam [S] sam file(s) containing aligned, pre-processed reads
-m M, --merged-fastq M
fastq file containing merged, pre-processed records

Optional arguments for disambiguation:
-l L, --frag-len L the number of bases from the 3 prime end to consider
when determining trancript overlaps

Optional arguments for STAR aligner:
--star-args SA [SA ...]
additional arguments for STAR. Pass as arg=value
without leading "--". e.g. runMode=alignReads
--list-default-star-args
list SEQDB default args for the STAR aligner


All SEQC runs require that you pass a SEQC index (`-i/--index`). These are STAR indices,
augmented by SEQC-specific files:

1. `annotations.gtf`: a modified GTF file, containing truncated sequence sizes that
reflect that we expect all data to fall within ~ 1kb of the transcriptional termination
sites. In addition, transcripts are tagged with "SCIDs", identifiers that merge
transcripts and genes which cannot be distinguished in the ~ 1kb of sequence that we
expect to observe.
2. `p_coalignments_array.p`: a binary file containing, for each SCID, the probability of
observing a co-alignment to other genes.

Human and mouse indices can be found on our aws s3 bucket at
`s3://dplab-data/genomes/mm38/` and `s3://dplab-data/genomes/hg38`. These indices
are built from recent releases of ENSEMBL genomes. These links can be passed directly to
SEQC, which will download them before beginning the analysis

If new indices must be generated, these can be produced by the SEQC index method:

$> SEQC index -h
usage: SEQC index [-h] [-b] [-t] -o O [O ...] -i I [-n N] [--phix]

optional arguments:
-h, --help show this help message and exit
-b, --build build a SEQC index
-t, --test test a SEQC index
-o O [O ...], --organism O [O ...]
build index for these organism(s)
-i I, --index I name of folder where index should be built or
containing the index to be verified
-n N, --n-threads N number of threads to use when building index
--phix add phiX to the genome index and GTF file.
$> # for example, to build a mouse index with phiX features added to mm38, in a
$> $ folder called 'mouse', using 7 threads
$> SEQC index -b -o mm38 -i mouse -n 7 --phix

Some data types require serialized barcode objects (`-b/--barcodes`). These objects contain
all of the barcodes for an experiment, as they would be expected to be observed.
For example, if you expect to observe the reverse complement of the barcodes you used to
construct the library, then this object should be built from reverse complements.

These barcode files can be found at `s3://dplab-data/barcodes/`. If you need to generate
a new barcode object, this can be accomplished with the built-in `PROCESS_BARCODES`
utility:

$> PROCESS_BARCODES -h
usage: PROCESS_BARCODES [-h] [-o O] [-b B [B ...]] [-p P]
[--reverse-complement]

optional arguments:
-h, --help show this help message and exit
-o O, --output_stem O
name and path for output file
-b B [B ...], --barcode_files B [B ...]
barcode files
-p P, --processor P type of experiment barcodes will be used in
--reverse-complement indicates that barcodes in fastq files are reverse
complements of the barcodes found in barcode files

Example usage:

`$> PROCESS_BARCODES -o ./in_drop_barcodes -b <barcode_file> -p in-drop --reverse-complement`
would save a new, reverse-complemented barcode object built from `<barcode_file>` at
`./in_drop_barcodes.p`
38 changes: 37 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,25 @@
__author__ = 'Ambrose J. Carr'

from setuptools import setup
from warnings import warn
import os
import shutil

# pip3 cannot install external dependencies for python; warn user if external dependencies
# are missing; do this at the end so that the users are more likely to see it.

# look in /usr/local/ and /usr/local/hdf5/ for hdf5 libraries;
# if found in /usr/local/hdf5/, set an environment variable to help pip3 install it.
h5fail = True
if os.path.isfile('/usr/lib/libhdf5.so'):
h5file = False
elif os.path.isfile('/usr/local/lib/libhdf5.so'):
h5fail = False
elif os.path.isfile('/usr/hdf5/lib/libhdf5.so'):
os.environ['HDF5_DIR'] = '/usr/hdf5/'
elif os.path.isfile('/usr/local/hdf5/lib/libhdf5.so'):
os.environ['HDF5_DIR'] = '/usr/local/hdf5/'
h5fail = False

setup(name='seqc',
version='0.1',
Expand All @@ -12,17 +31,34 @@
packages=['seqc', 'seqc.sa_postprocess', 'seqc.sa_preprocess', 'seqc.sa_process'],
install_requires=[
'numpy>=1.10.0',
'cython>0.14', # tables requirement
'numexpr>=2.4', # tables requirement
'pandas>=0.16.0',
'matplotlib>=1.4.3',
'seaborn',
'scipy>=0.14.0',
'boto3',
'pyftpdlib',
'intervaltree'],
'intervaltree',
'tables'],
scripts=['src/scripts/SEQC',
'src/scripts/PROCESS_BARCODES',
'src/scripts/TEST_BARCODES',
'src/scripts/process_multi_file_scseq_experiment.py',
'src/scripts/process_single_file_scseq_experiment.py'],
)

# print any warnings
if h5fail:
warn("""
SEQC: libhdf5 shared library "libhdf5.so" not found in /usr/local/lib/,
/usr/lib/, /usr/hdf5/lib/, or /usr/local/lib/hdf5/.
tables will not find h5lib and installation will likely fail unless the
HDF5_DIR environment variable has been set to the location that HDF5 was
installed into. If HDF5 is not installed, please install it prior to
installing SEQC.
""")

# look for star
if not shutil.which('STAR'):
warn('SEQC: STAR is not installed. SEQC will not be able to align files.')
22 changes: 22 additions & 0 deletions src/plugins/gitpull.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
from starcluster.clustersetup import ClusterSetup
from starcluster.logger import log
from subprocess import check_output, call


class DownloadRepo(ClusterSetup):
def __init__(self, dir_name='/data/software/'):
if not dir_name.endswith('/'):
dir_name += '/'
self.dir_name = dir_name

def run(self, nodes, master, user, user_shell, volumes):
folder = self.dir_name
log.info('installing seqc repo onto %s' % folder)

master.ssh.execute("mkdir %s" % folder)
location = folder + "seqc.tar.gz"
master.ssh.execute(
'curl -H "Authorization: token a22b2dc21f902a9a97883bcd136d9e1047d6d076" -L '
'https://api.github.com/repos/ambrosejcarr/seqc/tarball > %s' % location)
log.info("seqc.tar.gz has been downloaded in /data/software directory")
master.ssh.execute('pip3 install %s' % location)
Loading

0 comments on commit eeade49

Please sign in to comment.