Skip to content

Commit

Permalink
Merge pull request #272 from hall-lab/develop
Browse files Browse the repository at this point in the history
Pull in changes for svtools 0.4.0
  • Loading branch information
ernfrid authored Oct 2, 2018
2 parents 19ff895 + 295cb29 commit 2dc7f91
Show file tree
Hide file tree
Showing 102 changed files with 67,026 additions and 8,725 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
*~
.travis.yml.old
.travis.yml.new
*.old
*.old
geno_refine_scripts/*
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,13 @@ install:
# Useful for debugging any issues with conda
- conda info -a

- deps='libgfortran pip nose coverage statsmodels numpy pandas scipy'
- deps='libgfortran pip nose coverage statsmodels numpy pandas=0.19.2 scipy'
- conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION $deps
- source activate test-environment
- pip install svtyper

# command to run tests
script:
script:
# FIXME If these were modules, this shouldn't be necessary
nosetests --all-modules --traverse-namespace --with-coverage --cover-inclusive --with-id -v
after_success:
Expand Down
50 changes: 26 additions & 24 deletions DEVELOPER.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,11 +109,11 @@ These instructions assume you have committed no additional changes after tagging
3. Create the conda recipe skeleton.
1. Run conda skeleton.

```
conda skeleton pypi svtools
```
```
conda skeleton pypi svtools
```
2. Edit the tests section of the resulting `svtools/meta.yml` file to look like the following section:
```YAML
```YAML
test:
# Python imports
imports:
Expand All @@ -125,32 +125,34 @@ These instructions assume you have committed no additional changes after tagging
# entry points work.

- svtools --help
- svtools --version
- create_coordinates --help
```
```
3. Remove setuptools from the `run` subsection of `requirements` section. Failing to do so will result in a linting error in bioconda.
4. Build the conda recipe.

```
conda build -c bioconda svtools
```
```
conda build -c bioconda svtools
```
5. Test your recipe by installing it into a new conda environment. The bioconda channel is needed to pull in pysam.
1. Create a new conda environment to install into.

```
conda create --name svtools_install_test python=2.7 pip
```
```
conda create --name svtools_install_test python=2.7 pip
```
2. Install svtools from your local recipe.

```
conda install -c bioconda -n svtools_install_test --use-local svtools
```
```
conda install -c bioconda -n svtools_install_test --use-local svtools
```

6. Verify the install was successful.

```
source activate svtools_install_test
svtools --version
create_coordinates --version
```
```
source activate svtools_install_test
svtools --version
create_coordinates --version
```
**Note:** pyenv and conda versions of activate can conflict. If this is the case for you, simply source the full path of the conda activate script to activate the environment.

7. Ensure you have a clone/fork of https://github.com/bioconda/bioconda-recipes
Expand All @@ -159,11 +161,11 @@ These instructions assume you have committed no additional changes after tagging

We are currently preserving older versions in subdirectories. Create one with the name of the old version and copy the old recipe files there.

```
mkdir $REPO_PATH/bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
git mv $REPO_PATH/bioconda-recipes/recipes/svtools/build.sh bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
git mv $REPO_PATH/bioconda-recipes/recipes/svtools/meta.yaml bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
```
```
mkdir $REPO_PATH/bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
git mv $REPO_PATH/bioconda-recipes/recipes/svtools/build.sh bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
git mv $REPO_PATH/bioconda-recipes/recipes/svtools/meta.yaml bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
```

9. Copy over the build.sh and meta.yaml to the recipe folder

Expand Down
35 changes: 7 additions & 28 deletions INSTALL.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@
### <a name="python-env"></a> Preparing your Python environment
`svtools` requires Python 2.7. Before proceeding you need to prepare your Python environment, we highly recommend that you manage your installation of `svtools` with the [conda][1] package manager. Using conda helps avoid installation difficulties with scientific python packages (Numpy, SciPy).

For those not using conda, we recommend that you manage your installation with the [pip][2] package manager as shown below. Using pip will allow you to uninstall `svtools` easily. You might want to use [pyenv virtualenv][4] to create a virtual environment. There are instructions for setting up a virtualenv on the [pyenv github site](https://github.com/yyuu/pyenv/blob/master/README.md). pyenv installs pip by default.
For those not using conda, we recommend that you manage your installation with the [pip][2] package manager as shown below. Using pip will allow you to uninstall `svtools` easily. You might want to use [pyenv virtualenv][4] to create a virtual environment. There are instructions for setting up a virtualenv on the [pyenv github site](https://github.com/yyuu/pyenv/blob/master/README.md). pyenv installs pip by default.

The creation of the pyenv virtual environment and activation looks like this:

Expand All @@ -27,16 +27,12 @@ Once you have your Python environment set up you should be able to install the `

pip install svtools

_note:_ On older systems you may encounter an [error during pysam installation](https://github.com/pysam-developers/pysam/issues/262). This can be solved by specifying a version of [pysam][10] greater than 0.8.1 and less than 0.9.0

pip install 'pysam>=0.8.1,<0.9.0'

You can spot check your `svtools` install by running:

svtools --version

### <a name="git-install"></a> Installing directly from the git repo
Once you have your Python environment set up you will want to clone `svtools` from the [hall-lab github repository][5].
Once you have your Python environment set up you will want to clone `svtools` from the [hall-lab github repository][5].

git clone https://github.com/hall-lab/svtools.git svtools_test
cd svtools_test
Expand All @@ -53,16 +49,7 @@ _note:_ you can ignore the warning about "You are in 'detached HEAD' state."

OR, you can just proceed to install from master.

Now install the dependencies suggested in the requirements file

pip install statsmodels

Installing statsmodels can take a few minutes, but it satisfies the requirement for numpy, pandas, and scipy.
_note:_ On older systems you may encounter an [error during pysam installation](https://github.com/pysam-developers/pysam/issues/262). This can be solved by specifying a version of [pysam][10] greater than 0.8.1 and less than 0.9.0

pip install 'pysam>=0.8.1,<0.9.0'

Now we can use pip to install `svtools` from within the repo. If you are not already in the directory
We can use pip to install `svtools` from within the repo. If you are not already in the directory

cd svtools_test

Expand All @@ -81,24 +68,16 @@ Once you have your python environment set up, visit the [svtools releases github
Navigate to the download location on your filesystem and use the tar command:

tar -xvzf svtools-0.2.0b1.tar.gz
to expand the archive. [While you wait, enjoy this cartoon from xkcd][7].

to expand the archive. [While you wait, enjoy this cartoon from xkcd][7].

Now enter the directory that has been created.

cd svtools-0.2.0b1

Now install the dependencies suggested in the requiremnts files:

pip install statsmodels

Installing statsmodel can take a few minutes, but it satisfies the requirement for numpy, pandas, and scipy.

In our environment we need to specify [pysam][10] versions greater than 0.8.1 and less than 0.9.0:

pip install 'pysam>=0.8.1,<0.9.0'
pip install .

Installing the dependencies can take a few minutes, please be patient!

Finally we can spot check our `svtools` installation and observe the version number with the following command:

svtools --version
Expand Down
1 change: 0 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
#include external scripts
include svtools/bin/bedpesort
include svtools/bin/vcfsort
include svtools/bin/svtyper/svtyper
include versioneer.py
include svtools/_version.py
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# svtools - Comprehensive utilities to explore structural variations in genomes

[![License](https://img.shields.io/github/license/hall-lab/svtools.svg)](LICENSE.txt)
[![Build Status](https://travis-ci.org/hall-lab/svtools.svg?branch=master)](https://travis-ci.org/hall-lab/svtools)
[![Build Status](https://travis-ci.org/hall-lab/svtools.svg?branch=master)](https://travis-ci.org/hall-lab/svtools)
[![Coverage Status](https://coveralls.io/repos/github/hall-lab/svtools/badge.svg?branch=master)](https://coveralls.io/github/hall-lab/svtools?branch=master)

[![bioconda-badge](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io)
Expand All @@ -27,8 +27,8 @@
* [Scipy](https://www.scipy.org/)
* [Pandas](http://pandas.pydata.org/)
* [Statsmodels](http://statsmodels.sourceforge.net/)
* [Pysam](https://github.com/pysam-developers/pysam) (≥0.8.1)
* [SVTyper](https://github.com/hall-lab/svtyper) (0.7.0)

## Installation
We recommend you install using `conda`, but you may also install via `pip`. For more detailed instructions, see our [Installation guide](INSTALL.md).

Expand Down
17 changes: 17 additions & 0 deletions SUPPORT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Support Guidelines
Support is requested by submitting an issue to our github repository.

## Before you submit
Please ensure you've read the existing [README.md](README.md) and
[Tutorial.md](Tutorial.md) as this is our best documentation of
how to use the software.

## Information to include in your issue
The following information is extremely helpful:
* The version of svtools you're using.
* The precise command line you used when you encountered the issue.
* Any error output from svtools.

**NOTE** — While we understand that this is not feasible in all
cases, please provide a small file that reproduces the error. This gives us
a quick way to reproduce the error and start debugging.
31 changes: 31 additions & 0 deletions TieredMerging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Tiered merging for large cohorts
For very large SV callsets, we recommend a tiered approach to merging the individual Lumpy VCFs. This is useful to keep compute requirements modest and also to help smooth out batch effects between cohorts. For our own large cohorts, we've adopted a merging strategy whereby we merge groups of up to 1000 samples per cohort (larger cohorts will have multiple batches of 1000 or less) and then sort and merge the subsequent merged files again.

## Initial Per-batch sorting and merging

### Construct files containing the paths to each input VCF in a batch
`svtools lsort` can accept a file where each line is a path to an input VCF. For example,

```
/path/to/sample1.vcf
/path/to/sample2.vcf
```
Since there are a large number of samples (up to 1000!) in each batch, using these files can make your command line smaller.

### Sort and merge each batch
For each input file you constructed in the previous step, sort and merge the SV VCFs as in the Tutorial.

```
svtools lsort -f batch_of_lumpy_vcfs.txt
| svtools lmerge -i /dev/stdin -f 20
| bgzip -c > batch.merged.vcf.gz
```

## Final sorting and merging
After this step you will have one output file per batch. However, these files will _not_ contain genotypes so you'll need to specify additional options to ensure that they are properly combined. In the example below, we assume the input is a file containing the paths to each merged batch. **NOTE:** This step _requires_ that the SNAME field be present in your input files in order to weight the merging correctly.

```
svtools lsort -r -f file_of_merged_batches --batch-size 1
| svtools lmerge -i /dev/stdin -f 20 -w carrier_wt
| bgzip -c > final_output.merged.vcf.gz
```
27 changes: 21 additions & 6 deletions Tutorial.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
#Example analysis using `svtools`
# Example analysis using `svtools`
This tutorial will help you begin to explore the use of `svtools` to analyze an SV VCF. It will help you to satisfy the computing environment requirements, gather the required genomic data, and walk through basic analysis using `svtools`.
This tutorial includes example commands that you can alter to refer to your sample names.

##Table of Contents
## Table of Contents
1. Satisfy computing environment requirements
2. Gather genomic data and generate needed helper files
3. Use `svtools` to create a callset
Expand All @@ -19,6 +19,8 @@ This tutorial includes example commands that you can alter to refer to your samp
7. Use `svtools prune` to filter out additional variant calls likely representing the same variant
4. Use `svtools classify` to refine genotypes and SV types
1. Generate a repeat elements BED file
1. MEI file generation for hg19
2. MEI file generation for GRCh38
2. Generate a file specifying the number of X chromosome copies in each person
3. Download a file of high-quality, simple deletions and duplications
4. Generate a VCF of training variants
Expand Down Expand Up @@ -54,7 +56,7 @@ Follow the documentation on the [SpeedSeq Github page](https://github.com/hall-l
## Use `svtools` to create a callset
### Use `svtools lsort` to combine and sort variants from multiple samples
`svtools lsort` takes a space separated list of all of the LUMPY VCF files generated in the previous step as arguments or a file containing a single column with the paths to the LUMPY VCF files.
The example below shows us combining three samples. The output of this step is one sorted and compressed VCF file containing all variants detected in the three input files.
The example below shows us combining three samples. The output of this step is one sorted and compressed VCF file containing all variants detected in the three input files. This works well, even for thousands of samples, but for very large callsets (> 10,000 samples), we recommend a tiered merging strategy as described [here](TieredMerging.md).
```
svtools lsort NA12877.sv.vcf.gz NA12878.sv.vcf.gz NA12879.sv.vcf.gz \
| bgzip -c > sorted.vcf.gz
Expand All @@ -63,7 +65,9 @@ svtools lsort NA12877.sv.vcf.gz NA12878.sv.vcf.gz NA12879.sv.vcf.gz \
**Note:** `svtools lsort` will remove variants with the SECONDARY tag in the INFO field.
This will cause the sorted VCF to have fewer variant lines than the input.

###Use `svtools lmerge` to merge variant calls likely representing the same variant in the sorted VCF
### Use `svtools lmerge` to merge variant calls likely representing the same variant in the sorted VCF
This works well, even for thousands of samples, but for very large callsets (> 10,000 samples), we recommend a tiered merging strategy as described [here](TieredMerging.md).

```
zcat sorted.vcf.gz \
| svtools lmerge -i /dev/stdin -f 20 \
Expand Down Expand Up @@ -122,15 +126,15 @@ mkdir -p cn
Then run `svtools copynumber` to add in copynumber values to non-BND variants.
```
svtools copynumber \
--cnvnator cnvnator-multi \
--cnvnator cnvnator \
-s NA12877 \
-w 100 \
-r /temp/cnvnator-temp/NA12877.bam.hist.root \
-c coordinates \
-i gt/NA12877.vcf \
> cn/NA12877.vcf
```
**Note:** The argument to the `--cnvnator` option of `svtools copynumber` may need to be the full path to the cnvnator-multi executable included as part of SpeedSeq. This example assumes cnvnator-multi is installed system-wide.
**Note:** The argument to the `--cnvnator` option of `svtools copynumber` may need to be the full path to the cnvnator executable included as part of SpeedSeq. This example assumes that you used cnvnator and it is installed system-wide. Older versions of SpeedSeq used cnvnator-multi. You should use whichever version of cnvnator that was used to generate your root files.

### Use `svtools vcfpaste` to construct a VCF that pastes together the individual genotyped and copynumber annotated vcfs
`svtools vcfpaste` takes the list of the VCFs generated that contain the additional information for every sample that we have been building up step by step. In this tutorial we call that file cn.list and it contains one column that holds the path to the VCF files generated in the previous step.
Expand Down Expand Up @@ -168,6 +172,7 @@ The classifier can be run in several modes depending on the sample size. For thi
### Generate a repeat elements BED file
All `svtools classify` commands require a BED file of repeats for classifying Mobile Element Insertions (MEI). This can be created from the UCSC genome browser.

#### MEI file generation for hg19
```
curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \
| gzip -cdfq \
Expand All @@ -177,6 +182,16 @@ curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \
| bgzip -c > repeatMasker.recent.lt200millidiv.LINE_SINE_SVA.b37.sorted.bed.gz
```

#### MEI file generation for GRCh38
```
curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz \
| gzip -cdfq \
| awk '{ if ($3<200) print $6,$7,$8,$12"|"$13"|"$11,$3,$10 }' OFS="\t" \
| sort -k1,1V -k2,2n -k3,3n \
| awk '$4~"LINE" || $4~"SINE" || $4~"SVA"' \
| bgzip -c > repeatMasker.recent.lt200millidiv.LINE_SINE_SVA.GRCh38.sorted.bed.gz```
```
### Generate a file specifying the number of X chromosome copies in each person
All `svtools classify` commands require a tab-delimited file with two columns. The first column is the sample id and the second column is a number indicating the number of X chromosomes in the sample. Thus there should be a 1 for males and a 2 for females.
Expand Down
Loading

0 comments on commit 2dc7f91

Please sign in to comment.