celltype-ML

Overview

Analysis Walkthrough

Installing the Conda environment

Create the environment from the keras2.yml file: conda env create -f keras2.yml
Activate the environment: conda activate keras2

Partitioning sequences into training, validation, and test sets

The test set is composed of sequences from chromosomes 1 or 2. The validation set is composed of sequences from chromosomes 7 or 8. The remaining sequences in the dataset are partitioned into the training set.

The following example bash command will partition sequences from celltype1.bed into the training set:

awk -F "\t" '{ if ($1 != "chr8" && $1 != "chr9" && $1 != "chr1" && $1 != "chr2" ) {print}}' celltype1.bed > celltype1_TRAINING.bed

The following example bash command will partition sequences from celltype1.bed into the validation set:

awk -F "\t" '{ if ($1 == "chr8" || $1 == "chr9" ) {print}}' celltype1.bed  > celltype1_VALIDATION.bed

The following example bash command will partition sequences from celltype1.bed into the test set:

awk -F "\t" '{ if ($1 == "chr1" || $1 == "chr2" ) {print}}' celltype1.bed > celltype1_TEST.bed

The BED format files can then be converted into FASTA files using bedtools. We will need the assembly sequence in the reference genome of the species from which we have cell type measurements. These can generally be downloaded on the command line using a FTP link from UCSC. For example, the following command will download the mouse reference assembly sequence: wget https://hgdownload.soe.ucsc.edu/downloads.html

bedtools getfasta -fi mm10.fa -bed celltype1_TRAINING.bed > celltype1_TRAINING.fa
bedtools getfasta -fi mm10.fa -bed celltype1_VALIDATION.bed > celltype1_VALIDATION.fa
bedtools getfasta -fi mm10.fa -bed celltype1_TEST.bed > celltype1_TEST.fa

Training the Convolutional Neural Network (CNN)

The Python script keras_cnn.py can be used to train cell type models. The model architecture is hard-coded to follow one of the architecture described in Kaplow et al (bioRxiv, 2020) for a starting point. The required arguments are the input positive class and negative class FASTA files for the training and validation set.

usage: keras_cnn.py [-h] [-b BATCH] [-e EPOCH] [-n NAME] -pt POSTRAIN -nt
                    NEGTRAIN -pv POSVAL -nv NEGVAL [-bl BASELR] [-ml MAXLR]
                    [-bm BASEMOMENTUM] [-mm MAXMOMENTUM]

optional arguments:
  -h, --help            show this help message and exit
  -b BATCH, --batch BATCH
                        Batch size
  -e EPOCH, --epoch EPOCH
                        Number of epochs
  -n NAME, --name NAME  Output model name
  -pt POSTRAIN, --postrain POSTRAIN
                        Path to positive training set input FASTA file
  -nt NEGTRAIN, --negtrain NEGTRAIN
                        Path to negative training set input FASTA file
  -pv POSVAL, --posval POSVAL
                        Path to positive training set input FASTA file
  -nv NEGVAL, --negval NEGVAL
                        Path to negative training set input FASTA file
  -bl BASELR, --baselr BASELR
                        Base learning rate
  -ml MAXLR, --maxlr MAXLR
                        Maximum learning rate
  -bm BASEMOMENTUM, --basemomentum BASEMOMENTUM
                        Base momentum
  -mm MAXMOMENTUM, --maxmomentum MAXMOMENTUM
                        Maximum momentum

When training many cell type models, it can be helpful to use shell and sbatch scripts to train the models in parallel. By iterating over the cell types in a shell script, jobs can be submitted to a GPU compute node in parallel. In the example below, assume FASTAs for cell types 1, 2, 3, 4, etc. are in the current working directory:

#!/bin/bash

ls *.fa | awk '{split($0, a, "_"); print a[1]}' | sort | uniq | while read line; do
  echo $line
  sbatch run_cnn.sb $line
done

In run_cnn.sb, the compute specifications and input variables to keras_cnn.py can be defined:

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --nodes=1                # node count
#SBATCH --ntasks=1               # total number of tasks across all nodes
#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
#SBATCH --mem=16G                 # total memory per node (4 GB per cpu-core is default)
#SBATCH --gres=gpu:1             # number of gpus per node
#SBATCH --job-name cnn
#SBATCH --error=cnn-%J.err.txt
#SBATCH --output=cnn-%J.out.txt

NAME="${1}_b64e30.h5"
PT="${1}_train_positive.fa"
NT="${1}_train_negative.fa"
PV="${1}_valid_positive.fa"
NV="${1}_valid_negative.fa"

python keras_cnn.py -b 64 -e 30 -n ${NAME} -pt ${PT} -nt ${NT} -pv ${PV} -nv ${NV}

CNN Hyperparameter Tuning

The model performance and generalizability are especially sensitive to, but not limited to, the following hyperparameters:

Ratio of the size of the positive set to size of the negative set
Number of convolutional layers
Number of convolutional filters
Learning rate
Number of epochs
Batch size
Dropout

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
README.md		README.md
keras2.yml		keras2.yml
keras_cnn.py		keras_cnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

celltype-ML

Overview

Analysis Walkthrough

Installing the Conda environment

Partitioning sequences into training, validation, and test sets

Training the Convolutional Neural Network (CNN)

CNN Hyperparameter Tuning

About

Releases

Packages

Languages

pfenninglab/celltype-ML

Folders and files

Latest commit

History

Repository files navigation

celltype-ML

Overview

Analysis Walkthrough

Installing the Conda environment

Partitioning sequences into training, validation, and test sets

Training the Convolutional Neural Network (CNN)

CNN Hyperparameter Tuning

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages