CMImpute (Cross-species Methylation Imputation) is a imputation method based on a Conditional Variational Autoencoder (CVAE) to impute methylation values for species-tissue combinations that have not previously been experimentally profiled. CMImpute takes as input individual methylation samples along with their corresponding species and tissue labels. CMImpute outputs a species-tissue combination mean sample, or combination mean sample for short, that represents a species' average methylation values in a particular tissue type.
Final imputed species-tissue combination mean samples for species and tissue combinations that have not been experimentally profiled (19,786 magenta imputed combinations in figure above) can be found here.
- keras 2.10.0
- numpy 1.23.4
- pandas 1.4.4
- scikit-learn 0.24.2
- scipy 1.9.3
- tensorflow 2.10.0
The training data input should be formatted as such. Example training inputs can be found in the example directory. The data can be saved as a .pickle, .csv, or .tsv file with the extension infered from the file name.
- First column containing row names
- First row containing species and tissue names folowed by probe name
- Data consisting of one-hot encoded species and tissue labels followed by methylation values
Hyperparameters are selected by performing a grid search over different combinations of the following hyperparameters: number of hidden layers, hidden layer dimensions, activation function, learning rate, episilon, and latent space dimension. Refer to the manuscript for details on the hyperparameter values being tested. Use hyperparameter_tuning.py to train a model and record the performance for each hyperparameter combination in a text file.
The training data (training_data) contains individual methylation samples, with multiple potentially correponding to the same species-tissue combination. The observed combination mean samples (combo_averages) contain one sample per species-tissue combination representing the average methylation values. The example_data directory demonstrates the format. This data can be a .pickle, .csv, or .tsv file. The validation seed is optional. If it is not provided, a seed that results in a valid train-validation split will be selected (see manuscript for details).
python3 hyperparameter_tuning.py -h
Loading arguments
usage: Hyperparameter Tuning Script [-h] [--val_seed VAL_SEED]
combo_averages training_data t_start t_end s_start s_end d_start index output_dir
Trains a CVAE model based on a hyperparameter combination and saves the performance to a file
positional arguments:
combo_averages Path to .pickle, .csv, or .tsv for observed combination mean samples
training_data Path to .pickle, .csv, or .tsv for individual training samples
t_start Position of first one-hot-encoded tissue in the training data
t_end Position of last one-hot-encoded tissue in the training data
s_start Position of first one-hot-encoded species in the training data
s_end Position of last one-hot-encoded species in the training data
d_start Position of first probe in the training data
index Index of the hyperparameter combination
output_dir Path to output where hyperparameter combination performances will be scored
optional arguments:
-h, --help show this help message and exit
--val_seed VAL_SEED Random seed for selecting the validation dataset
Below is an example using the training data and observed data provided in the example_data directory. This is designed to specifically use a val_seed of 7558 to predict human blood and horse heart. There are 828 hyperparameter combinations that can be tested. The performance result will be saved in an individual file per combination in the designated output_dir.
python3 hyperparameter_tuning.py example_data/observed_combo_mean_samples.csv.gz example_data/train.csv.gz 0 58 59 406 407 1 example_data/best/ --val_seed 7558
There are two ways to train the model: by index (same index determined during hyperparameter selection) or by manual entry of the desired hyperparameters. If using the hyperparamter grid search from step 1, the saved files are labelled by index and contain the hyperparameter combination and the random seed used to train that model. The validation seed and random seed are optional.
python3 train_model_by_index.py -h
usage: Model Training Script [-h] [--val_seed VAL_SEED] [--seed SEED]
training_data t_start t_end s_start s_end d_start index encoder_save_loc decoder_save_loc
Trains a CVAE model based on the index of the hyperparameter combination from the grid search and saves the model to a
specified location
positional arguments:
training_data Path to .pickle, .csv, or .tsv for individual training samples
t_start Position of first one-hot-encoded tissue in the training data
t_end Position of last one-hot-encoded tissue in the training data
s_start Position of first one-hot-encoded species in the training data
s_end Position of last one-hot-encoded species in the training data
d_start Position of first probe in the training data
index Index of the hyperparameter combination
encoder_save_loc Path to location to save trained encoder model
decoder_save_loc Path to location to save trained decoder model
optional arguments:
-h, --help show this help message and exit
--val_seed VAL_SEED Random seed for selecting the validation dataset
--seed SEED Random seed used to initiate model training
Below is an example using training data provided in the example_data directory.
python3 train_model_by_index.py example_data/train.pickle 0 58 59 406 407 1 example_data/encoder_1_by_index.model example_data/decoder_1_by_index.model --val_seed 7558 --seed 6091
python3 train_model_by_params.py -h
usage: Model Training Script [-h] [--val_seed VAL_SEED] [--seed SEED]
training_data t_start t_end s_start s_end d_start encoder_save_loc decoder_save_loc n
activation_function latent_space_dimension learning_rate epsilon layout_index
Trains a CVAE model based on inputted hyperparameters and saves the model to a specified location
positional arguments:
training_data Path to .pickle, .csv, or .tsv for individual training samples
t_start Position of first one-hot-encoded tissue in the training data
t_end Position of last one-hot-encoded tissue in the training data
s_start Position of first one-hot-encoded species in the training data
s_end Position of last one-hot-encoded species in the training data
d_start Position of first probe in the training data
encoder_save_loc Path to location to save trained encoder model
decoder_save_loc Path to location to save trained decoder model
n Hidden layer dimension 2**n
activation_function Activation function for neural network training (i.e. relu, sigmoid, tanh)
latent_space_dimension
Size of encoded latent space
learning_rate Learning rate used by the Adam optimizer
epsilon Epsilon used by the Adam optimizer
layout_index Integer between 0 and 4 to select the layout of hidden layers (see paper for options)
optional arguments:
-h, --help show this help message and exit
--val_seed VAL_SEED Random seed for selecting the validation dataset
--seed SEED Random seed used to initiate model training
Below is an example using training data provided in the example_data directory.
python3 train_model_by_params.py example_data/train.csv.gz 0 58 59 406 407 example_data/encoder_1_by_params.model example_data/decoder_1_by_params.model 8 relu 2 0.001 1e-7 0 --val_seed 7558 --seed 6091
Once trained models are saved, the final imputation of species-tissue combination mean samples can be performed using the saved decoder. The combinations to be imputed can be inputted in one of three ways: a testing dataset, a single species and tissue combination via command line arguements, or a file with a list of combinations. The predictions are saved to a .pickle or .csv file inferred from the inputted output directory (pred_save_loc).
python3 generate_predictions.py -h
usage: Impute Combination Mean Samples [-h] [--tissue TISSUE] [--species SPECIES]
[--input_file INPUT_FILE]
data t_start t_end s_start s_end d_start
latent_space_dimension decoder pred_save_loc
Uses a trained decoder to impute species-tissue combination mean samples
positional arguments:
data Path to .pickle, .csv, or .tsv with either testing or training data (used to
one-hot-encoded label ordering and to extract combinations to be imputed if
testing)
t_start Position of first one-hot-encoded tissue in the training data
t_end Position of last one-hot-encoded tissue in the training data
s_start Position of first one-hot-encoded species in the training data
s_end Position of last one-hot-encoded species in the training data
d_start Position of first probe in the training data
latent_space_dimension
Latent space dimension needed for input into the decoder
decoder Path to .model file for the trained decoder
pred_save_loc Path to output where predictions will be saved
optional arguments:
-h, --help show this help message and exit
--tissue TISSUE Tissue of single species-tissue combination to impute
--species SPECIES Species of single species-tissue combination to impute
--input_file INPUT_FILE
Tab-delimited file containing species-tissue combinations to impute
The species and tissue combinations are extractd from those present in a testing dataset. The testing dataset is assumed to be in the same format as the training dataset.
generate_predictions.py example_data/test.pickle 0 58 59 406 407 2 example_data/decoder_1_by_index.model/ example_data/preds_from_test_data.pickle
A single species and tissue combination is provided as command line arguments using the --species and --tissue flags.
python3 generate_predictions.py example_data/train.csv.gz 0 58 59 406 407 2 example_data/decoder_1_by_index.model/ example_data/preds_from_command_line_args.csv --species Mouse --tissue Liver
The species and tissue combinations to be imputed are in a tab-delimited file (with tissues in the first column and species in the second). The file is provided as a command line argument using the --input_file flag.
generate_predictions.py example_data/train.pickle 0 58 59 406 407 2 example_data/decoder_1_by_params.model/ example_data/preds_from_input_file.csv --input_file example_data/combos_to_impute.txt