Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants
- This repo contains code for the paper
Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants
(techincal report coming soon) by Anusri Pampari*, Anna Shcherbina*, Anshul Kundaje. (*authors contributed equally) - Please contact [Anusri Pampari] (<first-name>@stanford.edu) for suggestions and comments.
- Here is a link to the slides and a comprehensive tutorial. Please see the FAQ and file a github issue if you have questions.
Chromatin profiles (DNASE-seq and ATAC-seq) exhibit multi-resolution shapes and spans regulated by co-operative binding of transcription factors (TFs). This complexity is further difficult to mine because of confounding bias from enzymes (DNASE-I/Tn5) used in these assays. Existing methods do not account for this complexity at base-resolution and do not account for enzyme bias correctly, thus missing the high-resolution architecture of these profile. Here we introduce ChromBPNet to address both these aspects.
ChromBPNet (shown in the image as Bias-Factorized ChromBPNet
) is a fully convolutional neural network that uses dilated convolutions with residual connections to enable large receptive fields with efficient parameterization. It also performs automatic assay bias correction in two steps, first by learning simple model on chromatin background that captures the enzyme effect (called Frozen Bias Model
in the image). Then we use this model to regress out the effect of the enzyme from the ATAC-seq/DNASE-seq profiles. This two step process ensures that the sequence component of the ChromBPNet model (called TF Model
) does not learn enzymatic bias.
This section will discuss the packages needed to train a ChromBPNet model. Firstly, it is recommended that you use a GPU for model training and have the necessary NVIDIA drivers and CUDA already installed. You can verify that your machine is set up to use GPU's properly by executing the nvidia-smi
command and ensuring that the command returns information about your system GPU(s) (rather than an error). Secondly there are two ways to ensure you have the necessary packages to train ChromBPNet models which we detail below,
Download and install the latest version of Docker for your platform. Here is the link for the installers -Docker Installers. Run the docker run command below to open a environment with all the packages installed and do cd chrombpnet
to start running the tutorial.
Note: To access your system GPU's from within the docker container, you must have NVIDIA Container Toolkit installed on your host machine.
docker run -it --rm --memory=100g --gpus device=0 kundajelab/chrombpnet:latest
Create a clean conda environment with python >=3.8
conda create -n chrombpnet python=3.8
conda activate chrombpnet
Install non-Python requirements via conda
conda install -y -c conda-forge -c bioconda samtools bedtools ucsc-bedgraphtobigwig pybigwig meme
pip install chrombpnet
git clone https://github.com/kundajelab/chrombpnet.git
pip install -e chrombpnet
The command to train ChromBPNet with pre-trained bias model will look like this:
chrombpnet pipeline \
-ibam /path/to/input.bam \ # only one of ibam, ifrag or itag is accepted
-ifrag /path/to/input.tsv \ # only one of ibam, ifrag or itag is accepted
-itag /path/to/input.tagAlign \ # only one of ibam, ifrag or itag is accepted
-d "ATAC" \
-g /path/to/hg38.fa \
-c /path/to/hg38.chrom.sizes \
-p /path/to/peaks.bed \
-n /path/to/nonpeaks.bed \
-fl /path/to/fold_0.json \
-b /path/to/bias.h5 \
-o path/to/output/dir/ \
-ibam
or-ifrag
or-itag
: input file path with filtered reads in one of bam, fragment or tagalign formats. Example files for supported types - bam, fragment, tagalign-d
: assay type. Following types are supported - "ATAC" or "DNASE"-g
: reference genome fasta file. Example file human reference - hg38.fa-c
: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes-p
: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed-n
: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed-fl
: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds-b
: Bias model in.h5
format. Bias models are generally transferable across assay types following similar protocol. Repository of pre-trained bias models for use here. Instructions to train custom bias model below.-o
: Output directory path
Please find scripts and best practices for preprocssing here.
The ouput directory will be populated as follows -
models\
bias_model_scaled.h5
chrombpnet.h5
chrombpnet_nobias.h5 (TF-Model i.e model to predict bias corrected accessibility profile)
logs\
chrombpnet.log (loss per epoch)
chrombpnet.log.batch (loss per batch per epoch)
(..other hyperparameters used in training)
auxilary\
filtered.peaks
filtered.nonpeaks
...
evaluation\
overall_report.pdf
overall_report.html
bw_shift_qc.png
bias_metrics.json
chrombpnet_metrics.json
chrombpnet_only_peaks.counts_pearsonr.png
chrombpnet_only_peaks.profile_jsd.png
chrombpnet_nobias_profile_motifs.pdf
chrombpnet_nobias_counts_motifs.pdf
chrombpnet_nobias_max_bias_response.txt
chrombpnet_nobias.....footprint.png
...
Detailed usage guide with more information on input arguments and the output file formats and how to work with them are provided here and here.
For more information, also see:
- Full documentation list
- Detailed list of input arguments
- Detailed usage guide with more information on the output file formats and how to work with them
- Best practices for preprocessing
- Training tutorial
- Frequently Asked Questions, FAQ
The command to train a custom bias bias model will look like this:
chrombpnet bias pipeline \
-ibam /path/to/input.bam \ # only one of ibam, ifrag or itag is accepted
-ifrag /path/to/input.tsv \ # only one of ibam, ifrag or itag is accepted
-itag /path/to/input.tagAlign \ # only one of ibam, ifrag or itag is accepted
-d "ATAC" \
-g /path/to/hg38.fa \
-c /path/to/hg38.chrom.sizes \
-p /path/to/peaks.bed \
-n /path/to/nonpeaks.bed \
-fl /path/to/fold_0.json \
-b 0.5 \
-o path/to/output/dir/ \
-ibam
or-ifrag
or-itag
: input file path with filtered reads in one of bam, fragment or tagalign formats. Example files for supported types - bam, fragment, tagalign-d
: assay type. Following types are supported - "ATAC" or "DNASE"-g
: reference genome fasta file. Example file human reference - hg38.fa-c
: chromosome and size tab seperated file. Example file in human reference - hg38.chrom.sizes-p
: Input peaks in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - peaks.bed-n
: Input nonpeaks (background regions)in narrowPeak file format, and must have 10 columns, with values minimally for chr, start, end and summit (10th column). Every region is centered at start + summit internally, across all regions. Example file with ENCSR868FGK dataset - nonpeaks.bed-f
: json file showing split of chromosomes for train, test and valid. Example 5 fold jsons for human reference - folds-o
: Output directory path
Please find scripts and best practices for preprocssing here.
The ouput directory will be populated as follows -
models\
bias.h5
logs\
bias.log (loss per epoch)
bias.log.batch (loss per batch per epoch)
(..other hyperparameters used in training)
intermediates\
...
evaluation\
overall_report.html
overall_report.pdf
pwm_from_input.png
k562_epoch_loss.png
bias_metrics.json
bias_only_peaks.counts_pearsonr.png
bias_only_peaks.profile_jsd.png
bias_only_nonpeaks.counts_pearsonr.png
bias_only_nonpeaks.profile_jsd.png
bias_predictions.h5
bias_profile.pdf
bias_counts.pdf
...
Detailed usage guide with more information on the input arguments and output file formats and how to work with them are provided here and here.
For more information, also see:
- Full documentation list
- Detailed list of input arguments
- Detailed usage guide with more information on the output file formats and how to work with them
- Best practices for preprocessing
- Training tutorial
- Frequently Asked Questions, FAQ
If you're using ChromBPNet in your work, please cite as follows:
@software{Pampari_Bias_factorized_base-resolution_2023,
author = {Pampari, Anusri and Shcherbina, Anna and Nair, Surag and Schreiber, Jacob and Patel, Aman and Wang, Austin and Kundu, Soumya and Shrikumar, Avanti and Kundaje, Anshul},
doi = {10.5281/zenodo.7567627},
month = {1},
title = {{Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants.}},
url = {https://github.com/kundajelab/chrombpnet},
version = {0.1.1},
year = {2023}
}