Merge pull request #272 from hall-lab/develop

Pull in changes for svtools 0.4.0
hall-lab · Oct 2, 2018 · 2dc7f91 · 2dc7f91
2 parents 19ff895 + 295cb29
commit 2dc7f91
Show file tree

Hide file tree

Showing 102 changed files with 67,026 additions and 8,725 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,4 +2,5 @@
 *~
 .travis.yml.old
 .travis.yml.new
-*.old
+*.old
+geno_refine_scripts/*
diff --git a/.travis.yml b/.travis.yml
@@ -18,12 +18,13 @@ install:
   # Useful for debugging any issues with conda
   - conda info -a
 
-  - deps='libgfortran pip nose coverage statsmodels numpy pandas scipy'
+  - deps='libgfortran pip nose coverage statsmodels numpy pandas=0.19.2 scipy'
   - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION $deps
   - source activate test-environment
+  - pip install svtyper
 
 # command to run tests
-script: 
+script:
 # FIXME If these were modules, this shouldn't be necessary
     nosetests --all-modules --traverse-namespace --with-coverage --cover-inclusive --with-id -v
 after_success:

diff --git a/DEVELOPER.md b/DEVELOPER.md
@@ -109,11 +109,11 @@ These instructions assume you have committed no additional changes after tagging
 3. Create the conda recipe skeleton.
   1. Run conda skeleton.
 
-    ```
-    conda skeleton pypi svtools
-    ```
+   ```
+   conda skeleton pypi svtools
+   ```
   2. Edit the tests section of the resulting `svtools/meta.yml` file to look like the following section:
-    ```YAML
+   ```YAML
     test:
       # Python imports
       imports:
@@ -125,32 +125,34 @@ These instructions assume you have committed no additional changes after tagging
       # entry points work.
 
       - svtools --help
+      - svtools --version
       - create_coordinates --help
-    ```
+   ```
+   3. Remove setuptools from the `run` subsection of `requirements` section. Failing to do so will result in a linting error in bioconda.
 4. Build the conda recipe.
 
-  ```
-  conda build -c bioconda svtools
-  ```
+ ```
+ conda build -c bioconda svtools
+ ```
 5. Test your recipe by installing it into a new conda environment. The bioconda channel is needed to pull in pysam.
   1. Create a new conda environment to install into.
 
-    ```
-    conda create --name svtools_install_test python=2.7 pip
-    ```
+   ```
+   conda create --name svtools_install_test python=2.7 pip
+   ```
   2. Install svtools from your local recipe.
 
-    ```
-    conda install -c bioconda -n svtools_install_test --use-local svtools
-    ```
+   ```
+   conda install -c bioconda -n svtools_install_test --use-local svtools
+   ```
 
 6. Verify the install was successful.
 
-  ```
-  source activate svtools_install_test
-  svtools --version
-  create_coordinates --version
-  ```
+ ```
+ source activate svtools_install_test
+ svtools --version
+ create_coordinates --version
+ ```
   **Note:** pyenv and conda versions of activate can conflict. If this is the case for you, simply source the full path of the conda activate script to activate the environment.
 
 7. Ensure you have a clone/fork of https://github.com/bioconda/bioconda-recipes
@@ -159,11 +161,11 @@ These instructions assume you have committed no additional changes after tagging
 
     We are currently preserving older versions in subdirectories. Create one with the name of the old version and copy the old recipe files there.
 
-    ```
-    mkdir $REPO_PATH/bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
-    git mv $REPO_PATH/bioconda-recipes/recipes/svtools/build.sh bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
-    git mv $REPO_PATH/bioconda-recipes/recipes/svtools/meta.yaml bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
-    ```
+   ```
+   mkdir $REPO_PATH/bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
+   git mv $REPO_PATH/bioconda-recipes/recipes/svtools/build.sh bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
+   git mv $REPO_PATH/bioconda-recipes/recipes/svtools/meta.yaml bioconda-recipes/recipes/svtools/$LAST_SVTOOLS_VERSION
+   ```
 
 9. Copy over the build.sh and meta.yaml to the recipe folder
 

diff --git a/INSTALL.md b/INSTALL.md
@@ -10,7 +10,7 @@
 ### <a name="python-env"></a> Preparing your Python environment
 `svtools` requires Python 2.7.  Before proceeding you need to prepare your Python environment, we highly recommend that you manage your installation of `svtools` with the [conda][1] package manager. Using conda helps avoid installation difficulties with scientific python packages (Numpy, SciPy).
 
-For those not using conda, we recommend that you manage your installation with the [pip][2] package manager as shown below.  Using pip will allow you to uninstall `svtools` easily.  You might want to use [pyenv virtualenv][4] to create a virtual environment. There are instructions for setting up a virtualenv on the [pyenv github site](https://github.com/yyuu/pyenv/blob/master/README.md). pyenv installs pip by default.  
+For those not using conda, we recommend that you manage your installation with the [pip][2] package manager as shown below.  Using pip will allow you to uninstall `svtools` easily.  You might want to use [pyenv virtualenv][4] to create a virtual environment. There are instructions for setting up a virtualenv on the [pyenv github site](https://github.com/yyuu/pyenv/blob/master/README.md). pyenv installs pip by default.
 
 The creation of the pyenv virtual environment and activation looks like this:
 
@@ -27,16 +27,12 @@ Once you have your Python environment set up you should be able to install the `
 
     pip install svtools
 
-_note:_ On older systems you may encounter an [error during pysam installation](https://github.com/pysam-developers/pysam/issues/262). This can be solved by specifying a version of [pysam][10] greater than 0.8.1 and less than 0.9.0
-
-    pip install 'pysam>=0.8.1,<0.9.0'
-
 You can spot check your `svtools` install by running:
 
     svtools --version
 
 ### <a name="git-install"></a> Installing directly from the git repo
-Once you have your Python environment set up you will want to clone `svtools` from the [hall-lab github repository][5].  
+Once you have your Python environment set up you will want to clone `svtools` from the [hall-lab github repository][5].
 
     git clone https://github.com/hall-lab/svtools.git svtools_test
     cd svtools_test
@@ -53,16 +49,7 @@ _note:_ you can ignore the warning about "You are in 'detached HEAD' state."
 
 OR, you can just proceed to install from master.
 
-Now install the dependencies suggested in the requirements file
-
-    pip install statsmodels
-
-Installing statsmodels can take a few minutes, but it satisfies the requirement for numpy, pandas, and scipy.
-_note:_ On older systems you may encounter an [error during pysam installation](https://github.com/pysam-developers/pysam/issues/262). This can be solved by specifying a version of [pysam][10] greater than 0.8.1 and less than 0.9.0 
-
-    pip install 'pysam>=0.8.1,<0.9.0'
-
-Now we can use pip to install `svtools` from within the repo. If you are not already in the directory
+We can use pip to install `svtools` from within the repo. If you are not already in the directory
 
     cd svtools_test
 
@@ -81,24 +68,16 @@ Once you have your python environment set up, visit the [svtools releases github
 Navigate to the download location on your filesystem and use the tar command:
 
     tar -xvzf svtools-0.2.0b1.tar.gz
-    
-to expand the archive.  [While you wait, enjoy this cartoon from xkcd][7]. 
+
+to expand the archive.  [While you wait, enjoy this cartoon from xkcd][7].
 
 Now enter the directory that has been created.
 
     cd svtools-0.2.0b1
-
-Now install the dependencies suggested in the requiremnts files:
-
-    pip install statsmodels
-
-Installing statsmodel can take a few minutes, but it satisfies the requirement for numpy, pandas, and scipy.
-
-In our environment we need to specify [pysam][10] versions greater than 0.8.1 and less than 0.9.0:
-
-    pip install 'pysam>=0.8.1,<0.9.0'
     pip install .
 
+Installing the dependencies can take a few minutes, please be patient!
+
 Finally we can spot check our `svtools` installation and observe the version number with the following command:
 
     svtools --version

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,6 +1,5 @@
 #include external scripts
 include svtools/bin/bedpesort
 include svtools/bin/vcfsort
-include svtools/bin/svtyper/svtyper
 include versioneer.py
 include svtools/_version.py
diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 # svtools - Comprehensive utilities to explore structural variations in genomes
 
 [![License](https://img.shields.io/github/license/hall-lab/svtools.svg)](LICENSE.txt)
-[![Build Status](https://travis-ci.org/hall-lab/svtools.svg?branch=master)](https://travis-ci.org/hall-lab/svtools) 
+[![Build Status](https://travis-ci.org/hall-lab/svtools.svg?branch=master)](https://travis-ci.org/hall-lab/svtools)
 [![Coverage Status](https://coveralls.io/repos/github/hall-lab/svtools/badge.svg?branch=master)](https://coveralls.io/github/hall-lab/svtools?branch=master)
 
 [![bioconda-badge](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io)
@@ -27,8 +27,8 @@
    * [Scipy](https://www.scipy.org/)
    * [Pandas](http://pandas.pydata.org/)
    * [Statsmodels](http://statsmodels.sourceforge.net/)
-   * [Pysam](https://github.com/pysam-developers/pysam) (≥0.8.1)
- 
+   * [SVTyper](https://github.com/hall-lab/svtyper) (0.7.0)
+
 ## Installation
 We recommend you install using `conda`, but you may also install via `pip`. For more detailed instructions, see our [Installation guide](INSTALL.md).
 

diff --git a/SUPPORT.md b/SUPPORT.md
@@ -0,0 +1,17 @@
+# Support Guidelines
+Support is requested by submitting an issue to our github repository.
+
+## Before you submit
+Please ensure you've read the existing [README.md](README.md) and
+[Tutorial.md](Tutorial.md) as this is our best documentation of 
+how to use the software.
+
+## Information to include in your issue
+The following information is extremely helpful:
+* The version of svtools you're using. 
+* The precise command line you used when you encountered the issue.
+* Any error output from svtools.
+
+**NOTE** — While we understand that this is not feasible in all
+cases, please provide a small file that reproduces the error. This gives us
+a quick way to reproduce the error and start debugging.
diff --git a/TieredMerging.md b/TieredMerging.md
@@ -0,0 +1,31 @@
+# Tiered merging for large cohorts
+For very large SV callsets, we recommend a tiered approach to merging the individual Lumpy VCFs. This is useful to keep compute requirements modest and also to help smooth out batch effects between cohorts. For our own large cohorts, we've adopted a merging strategy whereby we merge groups of up to 1000 samples per cohort (larger cohorts will have multiple batches of 1000 or less) and then sort and merge the subsequent merged files again.
+
+## Initial Per-batch sorting and merging
+
+### Construct files containing the paths to each input VCF in a batch
+`svtools lsort` can accept a file where each line is a path to an input VCF. For example,
+
+```
+/path/to/sample1.vcf
+/path/to/sample2.vcf
+```
+Since there are a large number of samples (up to 1000!) in each batch, using these files can make your command line smaller.
+
+### Sort and merge each batch
+For each input file you constructed in the previous step, sort and merge the SV VCFs as in the Tutorial.
+
+```
+svtools lsort -f batch_of_lumpy_vcfs.txt 
+  | svtools lmerge -i /dev/stdin -f 20 
+  | bgzip -c > batch.merged.vcf.gz
+```
+
+## Final sorting and merging
+After this step you will have one output file per batch. However, these files will _not_ contain genotypes so you'll need to specify additional options to ensure that they are properly combined. In the example below, we assume the input is a file containing the paths to each merged batch. **NOTE:** This step _requires_ that the SNAME field be present in your input files in order to weight the merging correctly.
+
+```
+svtools lsort -r -f file_of_merged_batches --batch-size 1
+  | svtools lmerge -i /dev/stdin -f 20 -w carrier_wt
+  | bgzip -c > final_output.merged.vcf.gz
+```
diff --git a/Tutorial.md b/Tutorial.md
@@ -1,8 +1,8 @@
-#Example analysis using `svtools`
+# Example analysis using `svtools`
 This tutorial will help you begin to explore the use of `svtools` to analyze an SV VCF.  It will help you to satisfy the computing environment requirements, gather the required genomic data, and walk through basic analysis using `svtools`.
 This tutorial includes example commands that you can alter to refer to your sample names.
 
-##Table of Contents
+## Table of Contents
 1. Satisfy computing environment requirements
 2. Gather genomic data and generate needed helper files
 3. Use `svtools` to create a callset
@@ -19,6 +19,8 @@ This tutorial includes example commands that you can alter to refer to your samp
     7. Use `svtools prune` to filter out additional variant calls likely representing the same variant  
 4. Use `svtools classify` to refine genotypes and SV types
     1. Generate a repeat elements BED file
+        1. MEI file generation for hg19
+        2. MEI file generation for GRCh38
     2. Generate a file specifying the number of X chromosome copies in each person
     3. Download a file of high-quality, simple deletions and duplications
     4. Generate a VCF of training variants
@@ -54,7 +56,7 @@ Follow the documentation on the [SpeedSeq Github page](https://github.com/hall-l
 ## Use `svtools` to create a callset
 ### Use `svtools lsort` to combine and sort variants from multiple samples
 `svtools lsort` takes a space separated list of all of the LUMPY VCF files generated in the previous step as arguments or a file containing a single column with the paths to the LUMPY VCF files.
-The example below shows us combining three samples.  The output of this step is one sorted and compressed VCF file containing all variants detected in the three input files.
+The example below shows us combining three samples.  The output of this step is one sorted and compressed VCF file containing all variants detected in the three input files.  This works well, even for thousands of samples, but for very large callsets (> 10,000 samples), we recommend a tiered merging strategy as described [here](TieredMerging.md).
 ```
 svtools lsort NA12877.sv.vcf.gz NA12878.sv.vcf.gz NA12879.sv.vcf.gz \
 | bgzip -c > sorted.vcf.gz
@@ -63,7 +65,9 @@ svtools lsort NA12877.sv.vcf.gz NA12878.sv.vcf.gz NA12879.sv.vcf.gz \
 **Note:** `svtools lsort` will remove variants with the SECONDARY tag in the INFO field.
 This will cause the sorted VCF to have fewer variant lines than the input.
 
-###Use `svtools lmerge` to merge variant calls likely representing the same variant in the sorted VCF
+### Use `svtools lmerge` to merge variant calls likely representing the same variant in the sorted VCF
+This works well, even for thousands of samples, but for very large callsets (> 10,000 samples), we recommend a tiered merging strategy as described [here](TieredMerging.md).
+
 ```
 zcat sorted.vcf.gz \
 | svtools lmerge -i /dev/stdin -f 20 \
@@ -122,15 +126,15 @@ mkdir -p cn
 Then run `svtools copynumber` to add in copynumber values to non-BND variants.
 ```
 svtools copynumber \
---cnvnator cnvnator-multi \
+--cnvnator cnvnator \
 -s NA12877 \
 -w 100 \
 -r /temp/cnvnator-temp/NA12877.bam.hist.root \
  -c coordinates \
  -i gt/NA12877.vcf \
 > cn/NA12877.vcf
 ```
-**Note:** The argument to the `--cnvnator` option of `svtools copynumber` may need to be the full path to the cnvnator-multi executable included as part of SpeedSeq. This example assumes cnvnator-multi is installed system-wide. 
+**Note:** The argument to the `--cnvnator` option of `svtools copynumber` may need to be the full path to the cnvnator executable included as part of SpeedSeq. This example assumes that you used cnvnator and it is installed system-wide. Older versions of SpeedSeq used cnvnator-multi. You should use whichever version of cnvnator that was used to generate your root files.
 
 ### Use `svtools vcfpaste` to construct a VCF that pastes together the individual genotyped and copynumber annotated vcfs
 `svtools vcfpaste` takes the list of the VCFs generated that contain the additional information for every sample that we have been building up step by step.  In this tutorial we call that file cn.list and it contains one column that holds the path to the VCF files generated in the previous step.
@@ -168,6 +172,7 @@ The classifier can be run in several modes depending on the sample size. For thi
 ### Generate a repeat elements BED file
 All `svtools classify` commands require a BED file of repeats for classifying Mobile Element Insertions (MEI). This can be created from the UCSC genome browser.
 
+#### MEI file generation for hg19
 ```
 curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \
 | gzip -cdfq \
@@ -177,6 +182,16 @@ curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/rmsk.txt.gz \
 | bgzip -c > repeatMasker.recent.lt200millidiv.LINE_SINE_SVA.b37.sorted.bed.gz
 ```
 
+#### MEI file generation for GRCh38
+```
+curl -s http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz \
+| gzip -cdfq \
+| awk '{ if ($3<200) print $6,$7,$8,$12"|"$13"|"$11,$3,$10 }' OFS="\t" \
+| sort -k1,1V -k2,2n -k3,3n \
+| awk '$4~"LINE" || $4~"SINE" || $4~"SVA"' \
+| bgzip -c > repeatMasker.recent.lt200millidiv.LINE_SINE_SVA.GRCh38.sorted.bed.gz```
+```
+
 ### Generate a file specifying the number of X chromosome copies in each person
 All `svtools classify` commands require a tab-delimited file with two columns. The first column is the sample id and the second column is a number indicating the number of X chromosomes in the sample. Thus there should be a 1 for males and a 2 for females.