This is the repository for our manuscript "PreMode predicts mode-of-action of missense variants by deep graph representation learning of protein sequence and structural context" posted on bioRxiv: https://www.biorxiv.org/content/10.1101/2024.02.20.581321v3
Unfortunately the data.files/af2.files/
, data.files/esm.files/
, data.files/esm.MSA/
, data.files/gMVP.MSA/
folders are too large to upload to git lfs. We provided those files in huggingface: https://huggingface.co/gzhong/PreMode
You can either clone this repository or the huggingface repository.
Then, unzip the files with this script:
bash unzip.files.sh
Unfortunately we are not allowed to share the HGMD data, so in the data.files/pretrain/training.*
files we removed all the pathogenic variants from HGMD (49218 in total). This might affect the plots of analysis/figs/fig.sup.12.pdf
and analysis/figs/fig.sup.13.pdf
if you re-run the R codes in analysis/
folder.
We shared the trained weights of our models trained using HGMD instead.
For details of the training/testing files, please check Data, Models & Figures in our manuscript
section below.
Please install the necessary packages using
mamba env create -f PreMode.yaml
mamba env create -f r4-base.yaml
You can check the installation by running
conda activate PreMode
python train.py --conf scripts/TEST.yaml --mode train
If no error occurs, it means successful installation.
- Please prepare a folder under
scripts/
and create a file namedpretrain.seed.0.yaml
inside the folder, checkscripts/PreMode/pretrain.seed.0.yaml
for example. - Run pretrain in pathogenicity task:
python train.py --conf scripts/NEW_FOLDER/pretrain.seed.0.yaml
- Prepare transfer learning config files:
bash scripts/DMS.prepare.yaml.sh scripts/NEW_FOLDER/
- Run transfer learning:
If you have multiple tasks, just separate each task by comma in the TASK_NAME, like "task_1,task_2,task_3".
bash scripts/DMS.5fold.run.sh scripts/NEW_FOLDER TASK_NAME GPU_ID
- (Optional) To reuse the transfer learning tasks in our paper using 8 GPU cards, just do
If you only have one GPU card, then do
bash transfer.all.sh scripts/NEW_FOLDER
bash transfer.all.in.one.card.sh scripts/NEW_FOLDER
- Save inference results:
bash scripts/DMS.5fold.inference.sh scripts/NEW_FOLDER analysis/NEW_FOLDER TASK_NAME GPU_ID
- You'll get a folder
analysis/NEW_FOLDER/TASK_NAME
with 5.csv
files, each file has 4 columnslogits.FOLD.[0-3]
. Each column represent the G/LoF prediction at one cross-validation (closer to 0 means more likely GoF, closer to 1 means more likely LoF). We suggest averaging the predictions at 4 columns.
-
Prepare a
.csv
file for training and inference, there are two accepted formats:-
Format 1 (only for missense variants):
uniprotID aaChg score ENST P15056 p.V600E 1 ENST00000646891 P15056 p.G446V -1 ENST00000646891 uniprotID
: the uniprot ID of the protein.aaChg
: the amino acid change induced by missense variant.score
: 1 for GoF, -1 for LoF. For inference it is not required. For DMS, this could be experimental readouts. If you have multiplexed assays, you can change it toscore.1, score.2, score.3, ..., score.N
.ENST
(optional): the ensemble transcript ID that matched the uniprotID.
-
Format 2 (can be missense variant or multiple variants):
uniprotID ref alt pos.orig score ENST wt.orig sequence.len.orig P15056 V E 600 1 ENST00000646891 ... 766 P15056 G V 446 -1 ENST00000646891 ... 766 P15056 G;V V;F 446;471 -1 ENST00000646891 ... 766 uniprotID
: the uniprot ID of the protein.ref
: the reference amino acid, if multiple variants, separated by ";".alt
: the alternative, if multiple variants, separated by ";" in the same order of "ref".pos.orig
: the amino acid change position, if multiple variants, separated by ";" in the same order of "ref".score
: same as above.ENST
(optional): same as above.wt.orig
: the wild type protein sequence, in the uniprot format.sequence.len.orig
: the wild type protein sequence length.
-
If you prepared your input in Format 1, please run
bash parse.input.table/parse.input.table.sh YOUR_FILE TRANSFORMED_FILE
to transform it to Format 2, note it will drop some lines if your aaChg doesn't match the corresponding alphafold sequence.
-
-
Prepare a config file for training the model and inference.
bash scripts/prepare.new.task.yaml.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME YOUR_TRAINING_FILE YOUR_INFERENCE_FILE TASK_TYPE MODE_OF_ACTION_N
PRETRAIN_MODEL_NAME
could be one of the following:scripts/PreMode
: Default PreModescripts/PreMode.ptm
: PreMode + ptm as inputscripts/PreMode.noStructure
: PreMode without structure inputscripts/PreMode.noESM
: PreMode, replaced ESM2 input with one-hot encodings of 20 AAs.scripts/PreMode.noMSA
: PreMode without MSA inputscripts/ESM.SLP
: ESM embedding + Single Layer Perceptron
YOUR_TASK_NAME
can be anything on your preferenceYOUR_TRAINING_FILE
is the training file you prepared in step 1.YOUR_INFERENCE_FILE
is the inference file you prepared in step 1.TASK_TYPE
could beDMS
orGLOF
.MODE_OF_ACTION_N
The number of dimensions of mode-of-action. ForGLOF
this is usually 1. For multiplexedDMS
dataset, this could be the number of biochemical properties measured. Note that if it is larger than 1, then you have to make sure thescore
column in step 1 is replaced toscore.1, score.2, ..., score.N
correspondingly.
-
Run your config file
conda activate PreMode bash scripts/run.new.task.sh PRETRAIN_MODEL_NAME YOUR_TASK_NAME OUTPUT_FOLDER GPU_ID
This should take ~30min on a NVIDIA A40 GPU depending on your data set size.
-
You'll get a file in the
OUTPUT_FOLDER
named asYOUR_TASK_NAME.inference.result.csv
.- If your
TASK_TYPE
isGLOF
, then the columnlogits
will be the inference results. Closer to 0 means GoF, closer to 1 means LoF. - If your
TASK_TYPE
isDMS
andMODE_OF_ACTION_N
is 1, then the columnlogits
will be the inference results. If yourMODE_OF_ACTION_N
is larger than 1, then you will get multiple columns oflogits.*
, each represent a predicted DMS measurement.
- If your
Here is the list of data used in our manuscript:
file | description |
---|---|
analysis/figs/ALL.csv |
All curated G/LoF variants in ~1300 genes |
data.files/PTEN/ |
PTEN multiplexed deep mutational scan measurements |
data.files/NUDT15/ |
NUDT15 multiplexed deep mutational scan measurements |
data.files/CCR5/ |
CCR5 multiplexed deep mutational scan measurements |
data.files/CXCR4/ |
CXCR4 multiplexed deep mutational scan measurements |
data.files/GCK/ |
GCK multiplexed deep mutational scan measurements |
data.files/SNCA/ |
SNCA multiplexed deep mutational scan measurements |
data.files/ASPA/ |
ASPA multiplexed deep mutational scan measurements |
data.files/CYP2C9/ |
CYP2C9 multiplexed deep mutational scan measurements |
data.files/ICC.seed.*/ |
G/LoF variants in 9 genes (named by uniprotID) for training/testing PreMode in our manuscript, 5 random splits |
Here is the list of models in our manuscript:
scripts/PreMode/
PreMode, it takes 250 GB RAM and 4 A40 Nvidia GPUs to run, will finish in ~50h.
scripts/ESM.SLR/
Baseline Model, ESM2 (650M) + Single Layer Perceptron
scripts/PreMode.large.window/
PreMode, window size set to 1251 AA.
scripts/PreMode.noESM/
PreMode, replace the ESM2 embeddings to one hot encodings of 20 AA.
scripts/PreMode.noMSA/
PreMode, remove the MSA input.
scripts/PreMode.noPretrain/
PreMode, but didn't pretrain on ClinVar/HGMD.
scripts/PreMode.noStructure/
PreMode, remove the AF2 predicted structure input.
scripts/PreMode.ptm/
PreMode, add the onehot encoding of post transcriptional modification sites as input.
scripts/PreMode.mean.var/
PreMode, it will output both predicted value (mean) and confidence (var), used in adaptive learning tasks.
gene | file |
---|---|
BRAF | analysis/5genes.all.mut/PreMode/P15056.logits.csv |
RET | analysis/5genes.all.mut/PreMode/P07949.logits.csv |
TP53 | analysis/5genes.all.mut/PreMode/P04637.logits.csv |
KCNJ11 | analysis/5genes.all.mut/PreMode/Q14654.logits.csv |
CACNA1A | analysis/5genes.all.mut/PreMode/O00555.logits.csv |
SCN5A | analysis/5genes.all.mut/PreMode/Q14524.logits.csv |
SCN2A | analysis/5genes.all.mut/PreMode/Q99250.logits.csv |
ABCC8 | analysis/5genes.all.mut/PreMode/Q09428.logits.csv |
PTEN | analysis/5genes.all.mut/PreMode/P60484.logits.csv |
For each file, column logits.0
is predicted pathogenicity. column logits.1
is predicted LoF probability, logits.2
is predicted GoF probability.
For PTEN, column logits.1
is predicted stability, 0 is loss, 1 is neutral, logits.2
is predicted enzyme activity, 0 is loss, 1 is neutral
Please go to analysis/
folder and run the corresponding R scripts.