Skip to content

Commit

Permalink
Merge pull request #52 from PlantandFoodResearch/dev
Browse files Browse the repository at this point in the history
v 0.1.0 alpha
  • Loading branch information
timothymillar authored Jun 13, 2017
2 parents 721e0b7 + bda32be commit 5757511
Show file tree
Hide file tree
Showing 17 changed files with 938 additions and 132 deletions.
43 changes: 32 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# TEFingerprint
A Python library with CLI for fingerprinting transposable elements based on paired-end reads.
This project is currently in a pe-alpha state and rapidly changing.
This project is currently in a alpha state and rapidly changing.


## Development Environment
Expand Down Expand Up @@ -93,25 +93,41 @@ bwa index -a bwtsw reference.fasta
bwa index -a is repeats.fasta
```

Runing the preprocess pipeline:
Map paired end reads to repeat elements using `BWA MEM` and convert to a sorted bam file with `samtools`:

```
bwa mem -t 8 \
repeats.fasta \
reads1.fastq \
reads2.fastq \
| samtools view -Su - \
| samtools sort - \
-o reads_mapped_to_repeats.bam
```

Index the resulting bam file with `samtools`:

```
samtools index reads_mapped_to_repeats.bam
```

Runing the preprocess program on the indexed bam file:

```
tef preprocess \
reads1.fastq \
reads2.fastq \
reads_mapped_to_repeats.bam
--reference reference.fasta \
--repeats repeats.fasta \
--output danglers.bam \
--threads 8
```

The output file `danglers.bam` contains reads mapped to the reference genome. Each of these reads is tagged with the repeat element that their pair was mapped to.

The `threads` argument specifies how many threads to use for alignment in bwa (python componants of the pipeline are currently single threaded).

By default, the intermediate files are written to a temporary directory that is automatically removed when the pipeline is completed. These files can be saved by manually specifying a directory with the `--tempdir` option.
Arguments:

By default, the same tage used to store repeat element names associated with each read is `ME` (Mate Element). This can be changed with the `--mate_element_tag` option.
* The `threads` argument specifies how many threads to use for alignment in bwa (python components of the pipeline are currently single threaded).
* By default, the intermediate files are written to a temporary directory that is automatically removed when the pipeline is completed. These files can be saved by manually specifying a directory with the `--tempdir` option.
* By default, the same tag used to store repeat element names associated with each read is `ME` (Mate Element). This can be changed with the `--mate_element_tag` option.

### Fingerprint

Expand All @@ -122,6 +138,7 @@ tef fingerprint danglers.bam \
-f family1 family2 ... \
-m 20 \
-e 500 \
-q 30 \
-t 4 \
> fingerprint.gff
```
Expand All @@ -134,13 +151,15 @@ Arguments:
* `-f/--families` Specifies the (super) families or grouping of repeated elements to fingerprint. These names are matched against the start of the mate element name i.e. the name `Gypsy` would treat reads with tagged with a mate element called `Gypsy3`, `Gypsy27` or `GypsyX` as the same.
* `-m/--minreads` Specifies the minimum number of read (tips) required to form a cluster. It is used in combination with `-e/epsilon`.
* `-e/epsilon` Specifies the maximum allowable distance among a set of read tips to be considered a (sub) cluster. Sub-clusters are calculated based on `-m/--minreads` and `-e/epsilon` and then overlapping sub-clusters are combined to create cluster.
* `-q/--mapping_quality` Specifies the minimum mapping quality allowed for reads.
* `-t/--threads` Specifies the number of CPU threads to use. The maximum number of threads that may be used is the same as the number of references specified.

Additional arguments:

* `--min_eps` The minimum value of epsilon to be used in hierarchical clustering. Defaults to `0`.
* `--hierarchical_clustering` Specifies wether or not to use the hierarchical clustering algorithm in order to differentiate between proximate clusters. Defaults to `True`.
* `--mate_element_tag` The sam tag used to specify the name of each reads mate element. Defaults to `ME`.
* `--feature_csv` Optionally specify the name of a CSV file to output containing feature data.


### Compare
Expand All @@ -165,6 +184,7 @@ Arguments:
* `-f/--families` Specifies the (super) families or grouping of repeated elements to fingerprint. These names are matched against the start of the mate element name i.e. the name `Gypsy` would treat reads with tagged with a mate element called `Gypsy3`, `Gypsy27` or `GypsyX` as the same.
* `-m/--minreads` Specifies the minimum number of read (tips) required to form a cluster. It is used in combination with `-e/epsilon`.
* `-e/epsilon` Specifies the maximum allowable distance among a set of read tips to be considered a (sub) cluster. Sub-clusters are calculated based on `-m/--minreads` and `-e/epsilon` and then overlapping sub-clusters are combined to create cluster.
* `-q/--mapping_quality` Specifies the minimum mapping quality allowed for reads.
* `-b/--fingerprint_buffer` Specifies a distance (in base pairs) to buffer fingerprints by before combining them into comparative bins. This is used to ensure that small clusters, that are slightly offset in different samples, are treated as a single comparative bin. It also improves the robustness of comparisons by allowing more reads to be included in each bin. Defaults to `0`
* `-t/--threads` Specifies the number of CPU threads to use. The maximum number of threads that may be used is the same as the number of references specified.

Expand All @@ -175,7 +195,8 @@ Additional arguments:
* `--hierarchical_clustering` Specifies wether or not to use the hierarchical clustering algorithm in order to differentiate between proximate clusters. Defaults to `True`.
* `--bin_buffer` The same as `--fingerprint_buffer` but buffering is performed after fingerprints are combined, therefore less likely to combine slightly offset clusters. Defaults to `0`
* `--mate_element_tag` The sam tag used to specify the name of each reads mate element. Defaults to `ME`.

* `--feature_csv` Optionally specify the name of a CSV file to output containing (long-form) feature data. This produces one row of data per sample per feature.
* `--character_csv` Optionally specify the name of a CSV file to output containing a matrix of read counts. This produces one column per feature by one row per sample.

### Filter GFF

Expand Down Expand Up @@ -210,4 +231,4 @@ Where `comparison.gff ` is a gff file and `comparison_filtered.gff` is a filtere
Arguments:

* `-c/--column_filters`: filters to apply to the first 8 standard gff3 columns. These should take the form `'<column><operator><value>'`
* `-a/--attribute_filters`: filters to apply to the attributes column. These should take the form `'<attribute><operator><value>'`
* `-a/--attribute_filters`: filters to apply to the attributes column. These should take the form `'<attribute><operator><value>'`
60 changes: 45 additions & 15 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ TEFingerprint
=============

A Python library with CLI for fingerprinting transposable elements based
on paired-end reads. This project is currently in a pe-alpha state and
on paired-end reads. This project is currently in a alpha state and
rapidly changing.

Development Environment
Expand Down Expand Up @@ -114,34 +114,51 @@ Index fasta files with bwa:
bwa index -a bwtsw reference.fasta
bwa index -a is repeats.fasta

Runing the preprocess pipeline:
Map paired end reads to repeat elements using ``BWA MEM`` and convert to
a sorted bam file with ``samtools``:

::

tef preprocess \
bwa mem -t 8 \
repeats.fasta \
reads1.fastq \
reads2.fastq \
| samtools view -Su - \
| samtools sort - \
-o reads_mapped_to_repeats.bam

Index the resulting bam file with ``samtools``:

::

samtools index reads_mapped_to_repeats.bam

Runing the preprocess program on the indexed bam file:

::

tef preprocess \
reads_mapped_to_repeats.bam
--reference reference.fasta \
--repeats repeats.fasta \
--output danglers.bam \
--threads 8

The output file ``danglers.bam`` contains reads mapped to the reference
genome. Each of these reads is tagged with the repeat element that their
pair was mapped to.

The ``threads`` argument specifies how many threads to use for alignment
in bwa (python componants of the pipeline are currently single
threaded).

By default, the intermediate files are written to a temporary directory
that is automatically removed when the pipeline is completed. These
files can be saved by manually specifying a directory with the
``--tempdir`` option.
Arguments:

By default, the same tage used to store repeat element names associated
with each read is ``ME`` (Mate Element). This can be changed with the
``--mate_element_tag`` option.
- The ``threads`` argument specifies how many threads to use for
alignment in bwa (python components of the pipeline are currently
single threaded).
- By default, the intermediate files are written to a temporary
directory that is automatically removed when the pipeline is
completed. These files can be saved by manually specifying a
directory with the ``--tempdir`` option.
- By default, the same tag used to store repeat element names
associated with each read is ``ME`` (Mate Element). This can be
changed with the ``--mate_element_tag`` option.

Fingerprint
~~~~~~~~~~~
Expand All @@ -154,6 +171,7 @@ Example usage:
-f family1 family2 ... \
-m 20 \
-e 500 \
-q 30 \
-t 4 \
> fingerprint.gff

Expand All @@ -177,6 +195,8 @@ Arguments:
of read tips to be considered a (sub) cluster. Sub-clusters are
calculated based on ``-m/--minreads`` and ``-e/epsilon`` and then
overlapping sub-clusters are combined to create cluster.
- ``-q/--mapping_quality`` Specifies the minimum mapping quality
allowed for reads.
- ``-t/--threads`` Specifies the number of CPU threads to use. The
maximum number of threads that may be used is the same as the number
of references specified.
Expand All @@ -190,6 +210,8 @@ Additional arguments:
proximate clusters. Defaults to ``True``.
- ``--mate_element_tag`` The sam tag used to specify the name of each
reads mate element. Defaults to ``ME``.
- ``--feature_csv`` Optionally specify the name of a CSV file to output
containing feature data.

Compare
~~~~~~~
Expand Down Expand Up @@ -226,6 +248,8 @@ Arguments:
of read tips to be considered a (sub) cluster. Sub-clusters are
calculated based on ``-m/--minreads`` and ``-e/epsilon`` and then
overlapping sub-clusters are combined to create cluster.
- ``-q/--mapping_quality`` Specifies the minimum mapping quality
allowed for reads.
- ``-b/--fingerprint_buffer`` Specifies a distance (in base pairs) to
buffer fingerprints by before combining them into comparative bins.
This is used to ensure that small clusters, that are slightly offset
Expand All @@ -252,6 +276,12 @@ Additional arguments:
to combine slightly offset clusters. Defaults to ``0``
- ``--mate_element_tag`` The sam tag used to specify the name of each
reads mate element. Defaults to ``ME``.
- ``--feature_csv`` Optionally specify the name of a CSV file to output
containing (long-form) feature data. This produces one row of data
per sample per feature.
- ``--character_csv`` Optionally specify the name of a CSV file to
output containing a matrix of read counts. This produces one column
per feature by one row per sample.

Filter GFF
~~~~~~~~~~
Expand Down
4 changes: 2 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ def read_file(file_name):


setup(name='tefingerprint',
version='0.0.3',
version='0.1.0',
author='Tim Millar',
author_email='tim.millar@plantandfood.co.nz',
url='https://github.com/PlantandFoodResearch/TEFingerprint',
description='Toolkit for identifying transposon movement',
long_description=read_file('README.MD'),
scripts=['applications/tef'],
packages=['tefingerprint'],
classifiers=['Development Status :: 2 - Pre-Alpha']
classifiers=['Development Status :: 3 - Alpha']
)
35 changes: 23 additions & 12 deletions tefingerprint/bamio.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
from itertools import product


def _read_bam_strings(bam, reference, strand):
def _read_bam_strings(bam, reference, strand, quality):
"""
Read strings from an indexed Bam file.
Expand All @@ -17,14 +17,19 @@ def _read_bam_strings(bam, reference, strand):
:type reference: str
:param strand: Select reads that are found on a specific strand, '+' or '-'
:type strand: str
:param quality: minimum mapping quality
:type quality: int
:return: A list of Sam formatted strings
:rtype: list[str]
"""
SAM_FLAGS = {'+': ('-F', '20'),
'-': ('-f', '16', '-F', '4')}
flag = SAM_FLAGS[strand]
return np.array(pysam.view(*flag, bam, reference).splitlines())
if strand == '+':
args = ('-q', str(quality), '-F', '20', bam, reference)
elif strand == '-':
args = ('-q', str(quality), '-f', '16', '-F', '4', bam, reference)
else:
raise ValueError
return np.array(pysam.view(*args).splitlines())


def _read_bam_reference_lengths(bam):
Expand Down Expand Up @@ -95,7 +100,7 @@ def _parse_read_loci(strings):
yield start, stop, name


def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
def _read_bam_ref_strand_loci(bam, reference, strand, categories, quality, tag='ME'):
"""
Read a single strand of a single reference from a bam file and categories by their associated transposon.
If a reference is specified by name only, its range will be extracted from the bam and appended.
Expand All @@ -108,14 +113,16 @@ def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
:type strand: str
:param categories: a list of transposon categories
:type categories: list[str]
:param quality: minimum mapping quality
:type quality: int
:param tag: the two letter sam tag containing the transposon associated with each read
:type tag: str
:return: a generator of category tuples and loci tuple generators
:rtype: generator[((str, str, str, str), generator[((int, int, str))])]
"""
source = os.path.basename(bam)
strings = _read_bam_strings(bam, reference, strand)
strings = _read_bam_strings(bam, reference, strand, quality)
if ':' in reference:
pass
else:
Expand All @@ -128,7 +135,7 @@ def _read_bam_ref_strand_loci(bam, reference, strand, categories, tag='ME'):
yield (reference, strand, category, source), loci


def _read_bam_ref_loci(bam, reference, categories, tag='ME'):
def _read_bam_ref_loci(bam, reference, categories, quality, tag='ME'):
"""
Read both strands of a single reference from a bam file and categories by their associated transposon.
If a reference is specified by name only, its range will be extracted from the bam and appended.
Expand All @@ -139,19 +146,21 @@ def _read_bam_ref_loci(bam, reference, categories, tag='ME'):
:type reference: str
:param categories: a list of transposon categories
:type categories: list[str]
:param quality: minimum mapping quality
:type quality: int
:param tag: the two letter sam tag containing the transposon associated with each read
:type tag: str
:return: a generator of category tuples and loci tuple generators
:rtype: generator[((str, str, str, str), generator[((int, int, str))])]
"""
for block in _read_bam_ref_strand_loci(bam, reference, '+', categories, tag=tag):
for block in _read_bam_ref_strand_loci(bam, reference, '+', categories, quality, tag=tag):
yield block
for block in _read_bam_ref_strand_loci(bam, reference, '-', categories, tag=tag):
for block in _read_bam_ref_strand_loci(bam, reference, '-', categories, quality, tag=tag):
yield block


def extract_bam_reads(bams, categories, references=None, tag='ME'):
def extract_bam_reads(bams, categories, references=None, quality=30, tag='ME'):
"""
Extract reads from one or more bam file(s) and categories reads by their reference, strand, associated transposon,
and source bam file.
Expand All @@ -163,6 +172,8 @@ def extract_bam_reads(bams, categories, references=None, tag='ME'):
:type references: str | list[str]
:param categories: target transposon categories
:type categories: str | list[str]
:param quality: minimum mapping quality
:type quality: int
:param tag: the two letter sam tag containing the transposon associated with each read
:type tag: str
Expand All @@ -183,7 +194,7 @@ def extract_bam_reads(bams, categories, references=None, tag='ME'):
# run jobs
jobs = product(bams, references)
for bam, reference in jobs:
for block in _read_bam_ref_loci(bam, reference, categories, tag=tag):
for block in _read_bam_ref_loci(bam, reference, categories, quality, tag=tag):
yield block


Expand Down
Loading

0 comments on commit 5757511

Please sign in to comment.