Learning Sequence Motifs

This is a repository that contains datasets and scripts to reproduce the results of "Representation Learning of Genomic Sequence Motifs with Convolutional Neural Networks" by Peter K. Koo and Sean R. Eddy.

The code here depends on Deepomics, a custom-written, high-level APIs written on top of Tensorflow to seamlessly build, train, test, and evaluate neural network models. WARNING: Deepomics is a required sub-repository. To properly clone this repository, please use:

$ git clone --recursive \url{https://github.com/p-koo/learning_sequence_motifs.git}

Dependencies

Tensorflow r1.0 or greater (preferably r1.4 or r1.5)
Python dependencies: PIL, matplotlib, numpy, scipy, sklearn

Overview of the code

To generate datasets:

code/Generate_synthetic_datasets.ipynb
code/Generate_invivo_datasets.ipynb

To train the models on the synthetic dataset and the in vivo dataset:

code/train_synthetic_data.py
code/train_invivo_data.py

These scripts loop through all models described in the manuscript. Each model can be found in /code/models/

To evaluate the performance of each model on the test set:

code/print_performance_table_synthetic.py
code/print_performance_table_invivo.py

To visualize and save 1st convolutional layer filters and also save a .meme file for the Tomtom search comparison tool:

code/plot_conv_filters_synthetic.py
code/plot_conv_filters_invivo.py

To perform the Tomtom search comparison tool :

code/tomtom_compare.sh

Requires Tomtom installation as well as command-line abilities from the current directory.

To visualize guided-backprop saliency maps:

code/Saliency_comparison.ipynb

Overview of data

Due to size restrictions, the dataset is not included in the repository. Each dataset can be easily created by running the python notebooks: Generate_synthetic_datasets.ipynb and Generate_invivo_datasets.ipynb
JASPAR_CORE_2016_vertebrates.meme contains a database of PWMs which is used for the Tomtom comparison search
pfm_vertebrates.txt also contrains JASPAR motifs. This is the file that is used as ground truth for the synthetic dataset.

Overview of results

All results for each CNN model and dataset are saved in a respective directory (synthetic or invivo).
Trained model parameters are saved in results/synthetic/model_params.
visualization for convolution filters and results from Tomtom are saved in results/synthetic/conv_filters
A reported performance table is saved in results/synthetic/performance_summary.tsv (automatically outputted from print_performance_table_synthetic.py)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
code		code
data		data
results		results
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Sequence Motifs

Dependencies

Overview of the code

Overview of data

Overview of results

About

Releases

Packages

Languages

MolGen/learning_sequence_motifs

Folders and files

Latest commit

History

Repository files navigation

Learning Sequence Motifs

Dependencies

Overview of the code

Overview of data

Overview of results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages