Transposon binding motif search using deep learning

Code for training and visualizing outputs for models predicting insertion bias of transposons from DNA sequence

Usage

Clone this repository
Set up a conda environment using the following command

conda env create -f envs/environment.yaml
Activate environment, navigate to repository directory and run

pip install -e .

This will install the package in editable mode.
Code can be accessed using package name tn_motif. Use autoreload to use package in editable mode if you're in Jupyter.
See notebooks/run_models.ipynb for details on how to use the package. Briefly, define model classes in tn_motif/models/model_classes.py. The ModelTraining object in tn_motif/utils.models.py can be used for k-fold cross validation and prediction on a holdout test set.

Background

Transposon insertion sequencing (TnSeq) is widely used as a genetic screening method for microbial genomics research. Transposon, which are mobile genetic elements, can jump out of a cloning plasmid and insert into the genome, disrupting the expression of the gene. The mariner transposon is one such mobile element, with a site specificity for TA dinucleotides.

TnSeq data is often quite noisy with uneven coverage even within the same gene. Previously, this was attributed to PCR amplification biases. However, in my PhD research, I developed a method (UMI-TnSeq, Code, Paper), where I showed that unevenness in coverage does not stem from PCR bias, suggesting that the mariner transposon itself has binding preferences beyond the canonical TA motif.

Why this matters

Genes are classified as essential if there are no transposon mutation counts mapping to it. If there are no reads within a gene of interest purely due to the nucleotide sequence, it would lead to incorrect classification.

More generally, different bacterial species have different genomic GC content. If the motif is disfavored in high GC genomes, using the mariner transposon for genetic screens may not be appropriate.

Data

To remove the biases in counts to do mutant fitness, I restricted the analysis to sites that lie within non-essential genes (fitness when disrupted > -2.5%). Processed data from the original publication is stored in data.

For model training, I define a neighborhood around the transposon site and one-hot encode this sequence (see tn_motif/utils/dataset.py and tn_motif/utils/encode.py) for details.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
notebooks		notebooks
tn_motif		tn_motif
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yaml		environment.yaml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transposon binding motif search using deep learning

Usage

Background

Why this matters

Data

About

Releases

Packages

Languages

License

anuraglimdi/transposon_binding_motif

Folders and files

Latest commit

History

Repository files navigation

Transposon binding motif search using deep learning

Usage

Background

Why this matters

Data

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages