This repository provides the source codes for our developed multi-task deep learning framework for simultaneous prediction of protein secondary structure populations (SSPs) and intrinsically disordered proteins (IDPs) and regions (IDRs). The related manuscript entitled “Simultaneous prediction of protein secondary structure population and intrinsic disorder using multi-task deep learning” has been submitted to Bioinformatics.
The deep learning implementation is based on the framework TensorFlow. Please refer to the original manuscript for detailed description of the singletask framework and the multitask framework for predicting SSPs and IDP/IDRs.
The input feature,the position-based scoring matrix (PSSM), was generated using the PSI-BLAST and parsed using a module 'chkparse' from the s2D method. The NCBI database uniref90filt.fasta.zip that BLAST searched against to generate the PSSM profiles was downloaded from the online server of the s2D method.
To run the scripts in this repository, you need to install
-
Python >= 2.7.x
-
Numpy >= 1.3
-
Tensorflow >= 1.3
-
matplotlib >= 2.1.0
-
subprocess >= 2.4
-
argparse >= 3.2
This program is tested on MacOS and GNU/Linux.
-
Please set up the parameter blast_path in dist/config.py to the binary path of your psiblast program.
-
Please set up the parameter uniref90_psi_blast_database in dist/config.py to the path of the compiled psiblast database, if any
-
If the psiblast database is not compiled yet, please compile the database before step 2.
-
Please set up the parameter gcc_path if you have a different command for running gcc.
-
Please set up the parameter tmp_path if you want to save the genearated .chk files from PSI-BLAST to a different directory.
-
For detailed instructions of the input and output parameters, please enter the directory dist and run the following command,
$ python run_prediction.py -h
This command will describe the input parameters and the output specifications in detail.
An brief introduction to the input and output parameters are given below.
-i INPUT Input file in FASTA format, required. You will need to prepare a input file of protein sequences in the format of FASTA.
-o OUTPUT Output file for storing generated results, optional. If set, the predicted results will be written to the destination file specified here.
-v Visualise the results, optional. If set, a graph demonstrating the predicted results will be generated and saved default to ../figure/visualisation.pdf.
-f The output path of generated visualisation, only available when -v is set, optional. If set, the generated graph will be saved to the destination path specified here.
-s Genearate additional results from the single task framework (DeepS2D-D) for IDP/IDR prediction as a comparison, optional.
- To predict the SSPs and IDP/IDRs by using the multi-task deep learning model, run the following command:
$ python run_predictin.py -i test.fasta
- To predict the SSPs and IDP/IDRs by using the multitask deep learning model and save the predicted results to output.txt, run the following command:
$ python run_predictin.py -i test.fasta -o output.txt
- To predict the SSPs and IDP/IDRs by using the multitask deep learning model, save the predicted results to output.txt and generate the visualisation of the results, run the following command:
$ python run_predictin.py -i test.fasta -o output.txt -v
- To predict the SSPs and IDP/IDRs by using the multitask deep learning model and IDP/IDRs by using the singletask deep learning model, run the following command:
$ python run_predictin.py -i test.fasta -s
- To predict the SSPs and IDP/IDRs by using the multitask deep learning model and IDP/IDRs by using the singletask deep learning model, and visualise the compared results in graphs, run the following command:
$ python run_predictin.py -i test.fasta -s -v
This project is licensed under GNU GPLv3