Merge pull request #52 from PlantandFoodResearch/dev

v 0.1.0 alpha
PlantandFoodResearch · Jun 13, 2017 · 5757511 · 5757511
2 parents 721e0b7 + bda32be
commit 5757511
Show file tree

Hide file tree

Showing 17 changed files with 938 additions and 132 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # TEFingerprint
 A Python library with CLI for fingerprinting transposable elements based on paired-end reads.
-This project is currently in a pe-alpha state and rapidly changing.
+This project is currently in a alpha state and rapidly changing.
 
 
 ## Development Environment
@@ -93,25 +93,41 @@ bwa index -a bwtsw reference.fasta
 bwa index -a is repeats.fasta
 ```
 
-Runing the preprocess pipeline:
+Map paired end reads to repeat elements using `BWA MEM` and convert to a sorted bam file with `samtools`:
+
+```
+bwa mem -t 8 \
+	repeats.fasta \
+	reads1.fastq \
+	reads2.fastq \
+	| samtools view -Su - \
+	| samtools sort - \
+	-o reads_mapped_to_repeats.bam
+```
+
+Index the resulting bam file with `samtools`:
+
+```
+samtools index reads_mapped_to_repeats.bam
+```
+
+Runing the preprocess program on the indexed bam file:
 
 ```
 tef preprocess \
-    reads1.fastq \
-    reads2.fastq \
+    reads_mapped_to_repeats.bam
     --reference reference.fasta \
-    --repeats repeats.fasta \
     --output danglers.bam \
     --threads 8
 ```
 
 The output file `danglers.bam` contains reads mapped to the reference genome. Each of these reads is tagged with the repeat element that their pair was mapped to.
 
-The `threads` argument specifies how many threads to use for alignment in bwa (python componants of the pipeline are currently single threaded).
-
-By default, the intermediate files are written to a temporary directory that is automatically removed when the pipeline is completed. These files can be saved by manually specifying a directory with the `--tempdir` option.
+Arguments:
 
-By default, the same tage used to store repeat element names associated with each read is `ME` (Mate Element). This can be changed with the `--mate_element_tag` option.
+* The `threads` argument specifies how many threads to use for alignment in bwa (python components of the pipeline are currently single threaded).
+* By default, the intermediate files are written to a temporary directory that is automatically removed when the pipeline is completed. These files can be saved by manually specifying a directory with the `--tempdir` option.
+* By default, the same tag used to store repeat element names associated with each read is `ME` (Mate Element). This can be changed with the `--mate_element_tag` option.
 
 ### Fingerprint
 
@@ -122,6 +138,7 @@ tef fingerprint danglers.bam \
 	-f family1 family2 ... \
 	-m 20 \
 	-e 500 \
+	-q 30 \
 	-t 4 \
 	> fingerprint.gff
 ```
@@ -134,13 +151,15 @@ Arguments:
 * `-f/--families` Specifies the (super) families or grouping of repeated elements to fingerprint. These names are matched against the start of the mate element name i.e. the name `Gypsy` would treat reads with tagged with a mate element called `Gypsy3`, `Gypsy27` or `GypsyX` as the same.
 * `-m/--minreads` Specifies the minimum number of read (tips) required to form a cluster. It is used in combination with `-e/epsilon`.
 * `-e/epsilon` Specifies the maximum allowable distance among a set of read tips to be considered a (sub) cluster. Sub-clusters are calculated based on `-m/--minreads` and `-e/epsilon` and then overlapping sub-clusters are combined to create cluster.
+* `-q/--mapping_quality` Specifies the minimum mapping quality allowed for reads.
 * `-t/--threads` Specifies the number of CPU threads to use. The maximum number of threads that may be used is the same as the number of references specified.
 
 Additional arguments:
 
  * `--min_eps` The minimum value of epsilon to be used in hierarchical clustering. Defaults to `0`.
  * `--hierarchical_clustering` Specifies wether or not to use the hierarchical clustering algorithm in order to differentiate between proximate clusters. Defaults to `True`.
  *  `--mate_element_tag` The sam tag used to specify the name of each reads mate element. Defaults to `ME`.
+ *  `--feature_csv` Optionally specify the name of a CSV file to output containing feature data.
 
 
 ### Compare
@@ -165,6 +184,7 @@ Arguments:
 * `-f/--families` Specifies the (super) families or grouping of repeated elements to fingerprint. These names are matched against the start of the mate element name i.e. the name `Gypsy` would treat reads with tagged with a mate element called `Gypsy3`, `Gypsy27` or `GypsyX` as the same.
 * `-m/--minreads` Specifies the minimum number of read (tips) required to form a cluster. It is used in combination with `-e/epsilon`.
 * `-e/epsilon` Specifies the maximum allowable distance among a set of read tips to be considered a (sub) cluster. Sub-clusters are calculated based on `-m/--minreads` and `-e/epsilon` and then overlapping sub-clusters are combined to create cluster.
+* `-q/--mapping_quality` Specifies the minimum mapping quality allowed for reads.
 * `-b/--fingerprint_buffer` Specifies a distance (in base pairs) to buffer fingerprints by before combining them into comparative bins. This is used to ensure that small clusters, that are slightly offset in different samples, are treated as a single comparative bin. It also improves the robustness of comparisons by allowing more reads to be included in each bin. Defaults to `0`
 * `-t/--threads` Specifies the number of CPU threads to use. The maximum number of threads that may be used is the same as the number of references specified.
 
@@ -175,7 +195,8 @@ Additional arguments:
  * `--hierarchical_clustering` Specifies wether or not to use the hierarchical clustering algorithm in order to differentiate between proximate clusters. Defaults to `True`.
  * `--bin_buffer` The same as `--fingerprint_buffer` but buffering is performed after fingerprints are combined, therefore less likely to combine slightly offset clusters. Defaults to `0`
  *  `--mate_element_tag` The sam tag used to specify the name of each reads mate element. Defaults to `ME`.
-
+*  `--feature_csv` Optionally specify the name of a CSV file to output containing (long-form) feature data. This produces one row of data per sample per feature.
+*  `--character_csv` Optionally specify the name of a CSV file to output containing a matrix of read counts. This produces one column per feature by one row per sample.
 
 ### Filter GFF
 
@@ -210,4 +231,4 @@ Where `comparison.gff ` is a gff file and `comparison_filtered.gff` is a filtere
 Arguments:
 
 * `-c/--column_filters`: filters to apply to the first 8 standard gff3 columns. These should take the form `'<column><operator><value>'`
-* `-a/--attribute_filters`: filters to apply to the attributes column. These should take the form `'<attribute><operator><value>'`
+* `-a/--attribute_filters`: filters to apply to the attributes column. These should take the form `'<attribute><operator><value>'`
diff --git a/README.rst b/README.rst
@@ -2,7 +2,7 @@ TEFingerprint
 =============
 
 A Python library with CLI for fingerprinting transposable elements based
-on paired-end reads. This project is currently in a pe-alpha state and
+on paired-end reads. This project is currently in a alpha state and
 rapidly changing.
 
 Development Environment
@@ -114,34 +114,51 @@ Index fasta files with bwa:
     bwa index -a bwtsw reference.fasta
     bwa index -a is repeats.fasta
 
-Runing the preprocess pipeline:
+Map paired end reads to repeat elements using ``BWA MEM`` and convert to
+a sorted bam file with ``samtools``:
 
 ::
 
-    tef preprocess \
+    bwa mem -t 8 \
+        repeats.fasta \
         reads1.fastq \
         reads2.fastq \
+        | samtools view -Su - \
+        | samtools sort - \
+        -o reads_mapped_to_repeats.bam
+
+Index the resulting bam file with ``samtools``:
+
+::
+
+    samtools index reads_mapped_to_repeats.bam
+
+Runing the preprocess program on the indexed bam file:
+
+::
+
+    tef preprocess \
+        reads_mapped_to_repeats.bam
         --reference reference.fasta \
-        --repeats repeats.fasta \
         --output danglers.bam \
         --threads 8
 
 The output file ``danglers.bam`` contains reads mapped to the reference
 genome. Each of these reads is tagged with the repeat element that their
 pair was mapped to.
 
-The ``threads`` argument specifies how many threads to use for alignment
-in bwa (python componants of the pipeline are currently single
-threaded).
-
-By default, the intermediate files are written to a temporary directory
-that is automatically removed when the pipeline is completed. These
-files can be saved by manually specifying a directory with the
-``--tempdir`` option.
+Arguments:
 
-By default, the same tage used to store repeat element names associated
-with each read is ``ME`` (Mate Element). This can be changed with the
-``--mate_element_tag`` option.
+-  The ``threads`` argument specifies how many threads to use for
+   alignment in bwa (python components of the pipeline are currently
+   single threaded).
+-  By default, the intermediate files are written to a temporary
+   directory that is automatically removed when the pipeline is
+   completed. These files can be saved by manually specifying a
+   directory with the ``--tempdir`` option.
+-  By default, the same tag used to store repeat element names
+   associated with each read is ``ME`` (Mate Element). This can be
+   changed with the ``--mate_element_tag`` option.
 
 Fingerprint
 ~~~~~~~~~~~
@@ -154,6 +171,7 @@ Example usage:
         -f family1 family2 ... \
         -m 20 \
         -e 500 \
+        -q 30 \
         -t 4 \
         > fingerprint.gff
 
@@ -177,6 +195,8 @@ Arguments:
    of read tips to be considered a (sub) cluster. Sub-clusters are
    calculated based on ``-m/--minreads`` and ``-e/epsilon`` and then
    overlapping sub-clusters are combined to create cluster.
+-  ``-q/--mapping_quality`` Specifies the minimum mapping quality
+   allowed for reads.
 -  ``-t/--threads`` Specifies the number of CPU threads to use. The
    maximum number of threads that may be used is the same as the number
    of references specified.
@@ -190,6 +210,8 @@ Additional arguments:
    proximate clusters. Defaults to ``True``.
 -  ``--mate_element_tag`` The sam tag used to specify the name of each
    reads mate element. Defaults to ``ME``.
+-  ``--feature_csv`` Optionally specify the name of a CSV file to output
+   containing feature data.
 
 Compare
 ~~~~~~~
@@ -226,6 +248,8 @@ Arguments:
    of read tips to be considered a (sub) cluster. Sub-clusters are
    calculated based on ``-m/--minreads`` and ``-e/epsilon`` and then
    overlapping sub-clusters are combined to create cluster.
+-  ``-q/--mapping_quality`` Specifies the minimum mapping quality
+   allowed for reads.
 -  ``-b/--fingerprint_buffer`` Specifies a distance (in base pairs) to
    buffer fingerprints by before combining them into comparative bins.
    This is used to ensure that small clusters, that are slightly offset
@@ -252,6 +276,12 @@ Additional arguments:
    to combine slightly offset clusters. Defaults to ``0``
 -  ``--mate_element_tag`` The sam tag used to specify the name of each
    reads mate element. Defaults to ``ME``.
+-  ``--feature_csv`` Optionally specify the name of a CSV file to output
+   containing (long-form) feature data. This produces one row of data
+   per sample per feature.
+-  ``--character_csv`` Optionally specify the name of a CSV file to
+   output containing a matrix of read counts. This produces one column
+   per feature by one row per sample.
 
 Filter GFF
 ~~~~~~~~~~

diff --git a/setup.py b/setup.py
@@ -9,13 +9,13 @@ def read_file(file_name):
 
 
 setup(name='tefingerprint',
-      version='0.0.3',
+      version='0.1.0',
       author='Tim Millar',
       author_email='tim.millar@plantandfood.co.nz',
       url='https://github.com/PlantandFoodResearch/TEFingerprint',
       description='Toolkit for identifying transposon movement',
       long_description=read_file('README.MD'),
       scripts=['applications/tef'],
       packages=['tefingerprint'],
-      classifiers=['Development Status :: 2 - Pre-Alpha']
+      classifiers=['Development Status :: 3 - Alpha']
       )
diff --git a/tefingerprint/bamio.py b/tefingerprint/bamio.py
@@ -7,7 +7,7 @@
 from itertools import product
 
 
-def _read_bam_strings(bam, reference, strand):
+def _read_bam_strings(bam, reference, strand, quality):
     """
     Read  strings from an indexed Bam file.
 
@@ -17,14 +17,19 @@ def _read_bam_strings(bam, reference, strand):
     :type reference: str
     :param strand: Select reads that are found on a specific strand, '+' or '-'
     :type strand: str
+    :param quality: minimum mapping quality
+    :type quality: int
 
     :return: A list of Sam formatted strings
     :rtype: list[str]
     """
-    SAM_FLAGS = {'+': ('-F', '20'),
-                 '-': ('-f', '16', '-F', '4')}
-    flag = SAM_FLAGS[strand]
-    return np.array(pysam.view(*flag, bam, reference).splitlines())
+    if strand == '+':
+        args = ('-q', str(quality), '-F', '20', bam, reference)
+    elif strand == '-':
+        args = ('-q', str(quality), '-f', '16', '-F', '4', bam, reference)
+    else:
+        raise ValueError
+    return np.array(pysam.view(*args).splitlines())
 
 
 def _read_bam_reference_lengths(bam):
@@ -95,7 +100,7 @@ def _parse_read_loci(strings):
         yield start, stop, name
 
 
-def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
+def _read_bam_ref_strand_loci(bam, reference, strand, categories, quality, tag='ME'):
     """
     Read a single strand of a single reference from a bam file and categories by their associated transposon.
     If a reference is specified by name only, its range will be extracted from the bam and appended.
@@ -108,14 +113,16 @@ def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
     :type strand: str
     :param categories: a list of transposon categories
     :type categories: list[str]
+    :param quality: minimum mapping quality
+    :type quality: int
     :param tag: the two letter sam tag containing the transposon associated with each read
     :type tag: str
 
     :return: a generator of category tuples and loci tuple generators
     :rtype: generator[((str, str, str, str), generator[((int, int, str))])]
     """
     source = os.path.basename(bam)
-    strings = _read_bam_strings(bam, reference, strand)
+    strings = _read_bam_strings(bam, reference, strand, quality)
     if ':' in reference:
         pass
     else:
@@ -128,7 +135,7 @@ def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
         yield (reference, strand, category, source), loci
 
 
-def _read_bam_ref_loci(bam, reference, categories, tag='ME'):
+def _read_bam_ref_loci(bam, reference, categories, quality, tag='ME'):
     """
     Read both strands of a single reference from a bam file and categories by their associated transposon.
     If a reference is specified by name only, its range will be extracted from the bam and appended.
@@ -139,19 +146,21 @@ def _read_bam_ref_loci(bam, reference, categories, tag='ME'):
     :type reference: str
     :param categories: a list of transposon categories
     :type categories: list[str]
+    :param quality: minimum mapping quality
+    :type quality: int
     :param tag: the two letter sam tag containing the transposon associated with each read
     :type tag: str
 
     :return: a generator of category tuples and loci tuple generators
     :rtype: generator[((str, str, str, str), generator[((int, int, str))])]
     """
-    for block in _read_bam_ref_strand_loci(bam, reference, '+', categories, tag=tag):
+    for block in _read_bam_ref_strand_loci(bam, reference, '+', categories, quality, tag=tag):
         yield block
-    for block in _read_bam_ref_strand_loci(bam, reference, '-', categories, tag=tag):
+    for block in _read_bam_ref_strand_loci(bam, reference, '-', categories, quality, tag=tag):
         yield block
 
 
-def extract_bam_reads(bams, categories, references=None, tag='ME'):
+def extract_bam_reads(bams, categories, references=None, quality=30, tag='ME'):
     """
     Extract reads from one or more bam file(s) and categories reads by their reference, strand, associated transposon,
     and source bam file.
@@ -163,6 +172,8 @@ def extract_bam_reads(bams, categories, references=None, tag='ME'):
     :type references: str | list[str]
     :param categories: target transposon categories
     :type categories: str | list[str]
+    :param quality: minimum mapping quality
+    :type quality: int
     :param tag: the two letter sam tag containing the transposon associated with each read
     :type tag: str
 
@@ -183,7 +194,7 @@ def extract_bam_reads(bams, categories, references=None, tag='ME'):
     # run jobs
     jobs = product(bams, references)
     for bam, reference in jobs:
-        for block in _read_bam_ref_loci(bam, reference, categories, tag=tag):
+        for block in _read_bam_ref_loci(bam, reference, categories, quality, tag=tag):
             yield block