This project revolves around a fully data-driven system for designing synthetic regulatory elements. We apply a large map of 3.5M+ annotated DNase I Hypersensitive Sites (DHSs, Meuleman et al., Nature 2020) as the basis for a training set to teach a Generative Adversarial Network (GAN) how regulatory elements are encoded in the human genome sequence. This approach results in highly variable pools of synthetic sequences, yet never seen before in Nature. We further apply a supervised adaptation strategy to tune these sequences towards pre-defined cellular contexts of interest.
The general setup procedure is as follows:
We first create a fresh virtual environment dedicated to synthetic sequences development.
This virtual environment will hold all of the dependencies needed to run the various components of the project.
$ conda create -n SynthSeqs python=3.8
$ conda activate SynthSeqs
The above command creates a new environment named SynthSeqs
that uses python3.8, and subsequently activates it.
After running these commands, you should see the name of your virtualenv at the left of your terminal prompt in parentheses like so:
(SynthSeqs) username:~$ ...
This means that the SynthSeqs
virtualenv is active. To deactivate the virtualenv, simply run:
$ deactivate
A running list of python package dependencies lives in etc/requirements.txt
.
Make sure you have your SynthSeqs environment active, and run the following:
$ python -m pip install -r etc/requirements.txt
This should install all of the necessary python packages in your virtualenv.
The SynthSeqs code base is organized around several Python modules:
make_data
generates and preprocesses the numpy DHS sequence datasets,generator
trains a GAN on a large set of DHS sequences,classifier
trains a classifier network on more component-specific and high-confidence DHSs (can also perform hyperparam sweep),synth_seqs
the main module - handles the sequence tuning process.
Each module can be run with:
$ python3 -m MODULE
You can specify a custom output path with the command line arg --output
, otherwise it will default to storing everything under a directory called ~/synth_seqs_output/
.
To run the modules on the GPU, run:
sbatch submit/submit_{MODULE_NAME}.slurm
for a given module.
Pretrained classifier and GAN models are available in the pretrained_models
directory.
This module requires datasets and generator / classifier model weights to be stored under the parent output directory (default ~/synth_seqs_output/
). If you ran the make_data, generator and classifier modules previously everything should be setup correctly.
The synth_seqs module provides a CLI to customize the sequence tuning process:
-n, --num_sequences the total number of sequences to tune
-c, --component the target component to tune sequences towards
--seed random seed for generating fixed random tuning input vectors
-i, --num_iterations total number of tuning iterations
--save_interval the iteration interval at which to output tuning data
-o, --output_dir the parent directory for synthseqs results
--run_name the parent directory for tuning results for this specific run