EquiPNAS: improved protein-nucleic acid binding site prediction using protein-language-model-informed equivariant deep graph neural networks
by Rahmatullah Roche, Bernard Moussad, Md Hossain Shuvo, Sumit Tarafder, and Debswapna Bhattacharya
published in Nucleic Acids Research
Codebase for our improved protein-nucleic binding site prediction appraoch, EquiPNAS.
1.) We recommend conda virtual environment to install dependencies for EquiPNAS. The following command will create a virtual environment named 'EquiPNAS'
conda env create -f EquiPNAS_env.yml
2.) Then activate the virtual environment
conda activate EquiPNAS
3.) Download the trained models from here
- For protein-DNA binding site prediction, use models/EquiPNAS-DNA model
- For protein-RNA binding site prediction, use models/EquiPNAS-RNA model
That's it! EquiPNAS is ready to be used.
To see usage instructions, run python EquiPNAS.py -h
usage: EquiPNAS.py [-h] [--model_state_dict MODEL_STATE_DICT] [--indir INDIR] [--outdir OUTDIR] [--num_workers NUM_WORKERS]
options:
-h, --help show this help message and exit
--model_state_dict MODEL_STATE_DICT
Saved model
--indir INDIR Path to input data containing distance maps and input features (default 'datasets/DNA_test_129_Preprocessing_using_AlphaFold2/')
--outdir OUTDIR Prediction output directory
--num_workers NUM_WORKERS
Number of workers (default=4)
Here is an example of running EquiPNAS:
1.) Input target list and all input files should be inside input preprocessing directory (examples can be found here Preprocessing/
). A detailed preprocessing instructions can be found here
2.) Make an output directory mkdir output
3.) Run python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir Preprocessing/ --outdir output/
4.) The residue-level protein-DNA or protein-RNA binding site predictions are generated at output/
.
For protein-DNA binding site prediction, we obtain the training targets from here, and for protein-RNA binding site prediction, we obtain the training targets from here. Our full train dataset containing the train code, list, and features for both protein-DNA and protein-RNA combined altogether can be found here. The procedure for training is detailed as follows:
-
Download the train scripts from here
-
Extract the train scripts and move them to the current directory
tar -xzvf train_scripts.tar.gz
mv train_scripts/* .
To train protein-DNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here. Example train data for protein-DNA binding site prediction can be found here.
To retrain the protein-DNA binding site prediction model with our dataset, download the train features and data from here.
-
Extract the train features
tar -xzvf DNA_train_data.tar.gz
-
Run the train scripts:
python train_model.py --indir DNA_train_data/ --save_dir model/DNA/
The trained model will be saved inside: model/DNA
To train protein-RNA binding site predictions in your own dataset, input train target list and all input files should be inside the train data directory and can be preprocessed as described earlier here Example train data for protein-RNA binding site prediction can be found here.
To retrain the protein-RNA binding site prediction model with our dataset, download the train features and data from here.
-
Extract the train features
tar -xzvf RNA_train_data.tar.gz
-
Run the train scripts:
python train_model.py --indir RNA_train_data/ --save_dir model/RNA
The trained model will be saved inside: model/RNA/
For protein-DNA binding site prediction, we obtain the test targets for Test_129
from here, and for Test_181
from here For protein-RNA binding site prediction, we obtain the test targets from here. Our full test dataset containing the test list and features for all the benchmarking datasets can be found here. The procedure for test set benchmarking is detailed as follows:
-
First download the trained models from here
-
Extract the models
tar -xzvf models.tar.gz
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_129_Preprocessing_using_AlphaFold2.tar.gz
-
Create output prediction directory
mkdir outputs/DNA_test_129_predictions_using_AlphaFold2/
-
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_129_Preprocessing_using_AlphaFold2/ --outdir outputs/DNA_test_129_predictions_using_AlphaFold2/
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_129_Preprocessing_using_native.tar.gz
-
Create output prediction directory
mkdir outputs/DNA_test_129_predictions_using_native/
-
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_129_Preprocessing_using_native/ --outdir outputs/DNA_test_129_predictions_using_native/
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_181_Preprocessing_using_AlphaFold2.tar.gz
-
Create output prediction directory
mkdir outputs/DNA_test_181_predictions_using_AlphaFold2/
-
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_181_Preprocessing_using_AlphaFold2/ --outdir outputs/DNA_test_181_predictions_using_AlphaFold2/
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf DNA_test_181_Preprocessing_using_native.tar.gz
-
Create output prediction directory
mkdir outputs/DNA_test_181_predictions_using_native/
-
Run EquiPNAS prediction using the pretrained protein-DNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-DNA/E-l12-768.pt --indir DNA_test_181_Preprocessing_using_native/ --outdir outputs/DNA_test_181_predictions_using_native/
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf RNA_test_117_Preprocessing_using_AlphaFold2.tar.gz
-
Create output prediction directory
mkdir outputs/RNA_test_117_predictions_using_AlphaFold2/
-
Run EquiPNAS prediction using the pretrained protein-RNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-RNA/E-l12-768.pt --indir RNA_test_117_Preprocessing_using_AlphaFold2/ --outdir outputs/RNA_test_117_predictions_using_AlphaFold2/
-
Download the test list, data, and features from here
-
Extract the features
tar -xzvf RNA_test_117_Preprocessing_using_native.tar.gz
-
Create output prediction directory
mkdir outputs/RNA_test_117_predictions_using_native/
-
Run EquiPNAS prediction using the pretrained protein-RNA model
python EquiPNAS.py --model_state_dict models/EquiPNAS-RNA/E-l12-768.pt --indir RNA_test_117_Preprocessing_using_native/ --outdir outputs/RNA_test_117_predictions_using_native/