DeepCpG [1] is a deep neural network for predicting the methylation state of CpG dinucleotides in multiple cells. It allows to accurately impute incomplete DNA methylation profiles, to discover predictive sequence motifs, and to quantify the effect of sequence mutations. (Angermueller et al, 2017).
Please help to improve DeepCpG, by reporting bugs, typos in notebooks and documentation, or any ideas on how to make things better. You can submit an issue or send me an email.
DeepCpG model architecture and applications.
[1] | Angermueller, Christof, Heather Lee, Wolf Reik, and Oliver Stegle. Accurate Prediction of Single-Cell DNA Methylation States Using Deep Learning. http://biorxiv.org/content/early/2017/02/01/055715 bioRxiv, February 1, 2017, 55715. doi:10.1101/055715. |
- 170406: A short description of all DeepCpG scripts!
- 170404: New guide on creating and analyzing DeepCpG data released!
- 170404: Training on continuous data, e.g. from bulk experiments, now supported!
- 170305: New documentation of DeepCpG model architectures released!
- 170302: New guide on DeepCpG model training released!
- 170228: New example shell scripts for building a DeepCpG pipeline released!
The easiest way to install DeepCpG is to use PyPI
:
pip install deepcpg
Alternatively, you can checkout the repository,
git clone https://github.com/cangermueller/deepcpg.git
and then install DeepCpG using setup.py
:
python setup.py install
- Store known CpG methylation states of each cell into a tab-delimted file with the following columns:
- Chromosome (without chr)
- Position of the CpG site on the chromosome starting with one
- Binary methylation state of the CpG sites (0=unmethylation, 1=methylated)
Example:
1 3000827 1.0 1 3001007 0.0 1 3001018 1.0 ... Y 90829839 1.0 Y 90829899 1.0 Y 90829918 0.0
- Run
dcpg_data.py
to create the input data for DeepCpG:
dcpg_data.py
--cpg_profiles ./cpg/cell1.tsv ./cpg/cell2.tsv ./cpg/cell3.tsv
--dna_files ./dna/mm10
--cpg_wlen 50
--dna_wlen 1001
--out_dir ./data
./cpg/cell[123].tsv
store the methylation data from step 1., ./dna
contains the DNA database, e.g. mm10 for mouse or hg38 for human, and output data files will be stored in ./data
.
- Fine-tune a pre-trained model or train your own model from scratch with
dcpg_train.py
:
dcpg_train.py
./data/c{1,3,6,7,9}_*.h5
--val_data ./data/c{13,14,15,16,17,18,19}_*.h5
--dna_model CnnL2h128
--cpg_model RnnL1
--joint_model JointL2h512
--nb_epoch 30
--out_dir ./model
This command uses chromosomes 1-3 for training and 10-13 for validation. ---dna_model
, --cpg_model
, and --joint_model
specify the architecture of the CpG, DNA, and Joint model, respectively (see manuscript for details). Training will stop after at most 30 epochs and model files will be stored in ./model
.
- Use
dcpg_eval.py
to impute methylation profiles and evaluate model performances.
dcpg_eval.py
./data/*.h5
--model_files ./model/model.json ./model/model_weights_val.h5
--out_data ./eval/data.h5
--out_report ./eval/report.tsv
This command predicts missing methylation states on all chromosomes and evaluates prediction performances using known methylation states. Predicted states will be stored in ./eval/data.h5
and performance metrics in ./eval/report.tsv
.
- Export imputed methylation profiles to HDF5 or bedGraph files:
dcpg_eval_export.py
./eval/data.h5
-o ./eval/hdf
-f hdf
You can find example notebooks and scripts on how to use DeepCpG here.
The DeepCpG documentation provides information on training, hyper-parameter selection, and model architectures.
You can download pre-trained models from the DeepCpG model zoo.
Why am I getting warnings 'No CpG site at position X!' when using `dcpg_data.py`?
This means that some sites in --cpg_profile
files are not CpG sites, i.e. there is no CG dinucleotide at the given position in the DNA sequence. Make sure that --dna_files
point to the correct genome and CpG sites are correctly aligned. Since DeepCpG currently does not support allele-specific methylation, data from different alleles must be merged (recommended) or only one allele be used.
How can I train models on one or more GPUs? DeepCpG use the Keras deep learning library, which supports Theano or Tensorflow as backend. If you are using Tensorflow, DeepCpG will automatically run on all available GPUs. If you are using Theano, you have to set the flag device=GPU in the THEANO_FLAGS environment variable.
THEANO_FLAGS='device=gpu,floatX=float32'
You can find more information about Keras backends here, and about parallelization here.
/deepcpg/
: Source code/docs
: Documentation/examples/
: Examples on how to use DeepCpG/script/
: Executable DeepCpG scripts/tests
: Test files
- Extends
dcpg_data.py
, updates documentation, and fixes minor bugs. - Extends
dcpg_data.py
to support bedGraph and TSV input files. - Enables training on continuous methylation states.
- Adds documentation about creating and analyzing Data.
- Updates documentation of scripts and library.
- Extends
- Christof Angermueller
- cangermueller@gmail.com
- https://cangermueller.com
- @cangermueller