The project can be setup using conda. Run the following commands in the root folder of the project after cloning it
conda env create -f environment.yml
conda activate birds
python -m pip install -e .
Run install.py script to download the required files and automatically extract them as well
python install.py
In case the downloads are not successful, please download the following files to the root directory of the project and then run the script
-
The files required by the model for training/testing can be downloaded from primary, backup
-
The models that were trained and used in the paper can be downloaded from primary, backup
cd birds
# View all the supported options for training
python train.py --help
# Train the default model (with parameters) used in the paper for fold 0
python train.py --fold 0
cd birds
# Test the models
python test.py
# Run the models on their validation sets
python test.py --validate
To visualize the results on the test set
python install.py --visualize
cd birds/visualize
python visualize.py
uniref50.fasta and uniclust30_2017_10_hhsuite are needed for the generation of MSAs
python install.py --predict
For predictions, the following format has to be followed. Generate a random 4 alphanumeric character PDB code, 1 character structure number for your protein sequence. Let's assume that the PDB code is abde and the structure number is 1. Then the directory and sequence file need to be created as follows
mkdir -p ./data/predict/raw/abde_1/
touch ./data/predict/raw/abde_1/sequence.fasta
In sequence.fasta, put the sequence in the format shown below
>ABDE:A|PDBID|CHAIN|SEQUENCE
NSELDRLSKDDRNWVMQTKDYSATHFSRLTEINSHNVKNLKVAWTLSTGTLHGHEGAPLVVDGIMYIHTPFPNNVYAVDLNDTRKMLWQYKPKQNPAARAVACCDVVNRGLAYVPAGEHGPAKIFLNQLDGHIVALNAKTGEEIWKMENSDIAMGSTLTGAPFVVKDKVLVGSAGAELGVRGYVTAYNIKDGKQEWRAYATGPDEDLLLDKDFNKDNPHYGQFGLGLSTWEGDAWKIGGGTNWGWYAYDPKLDMIYYGSGNPAPWNETMRPGDNKWTMTIWGRDADTGRAKFGYQKTPHDEWDYAGVNYMGLSEQEVDGKLTPLLTHPDRNGLVYTLNRETGALVNAFKIDDTVNWVKKVDLKTGLPIRDPEYSTRMDHNAKGICPSAMGYHNQGIESYDPDKKLFFMGVNHICMDWEPFMLPYRAGQFFVGATLNMYPGPKGMLGQVKAMNAVTGKMEWEVPEKFAVWGGTLATAGDLVFYGTLDGFIKARDTRTGELKWQFQLPSGVIGHPITYQHNGKQYIAIYSGVGGWPGVGLVFDLKDPTAGLGAVGAFRELAHYTQMGGSVFVFSL
>ABDE:B|PDBID|CHAIN|SEQUENCE
YDGTHCKAPGNCWEPKPGYPDKVAGSKYDPKHDPNELNKQAESIKAMEARNQKRVENYAKTGKFVYKVEDIK
Please note that the predictions will take time since they are dependent on the generation of MSAs. There is verbose logging and some speed up optimizations. In case it is taking too long. Please follow the instructions in msa_generator for generating MSAs for a lot of sequences
cd birds
python predict.py