This project constitutes a protein analysis pipeline that allows for a quick and comprehensive analysis of a protein sequence or structure.
The pipeline accepts as input a protein sequence in FASTA format or a protein structure in PDB format. If a PDB file is not provided, the 3D structure of the protein can be predicted using AlphaFold2.
A sequence-based analysis can be performed, including determining physicochemical properties and aligning the protein sequence against other databases such as Pfam and Swiss-Prot/UniRef90.
A structure-based analysis can be performed, including predicting the effect of amino acid substitutions over the protein stability using SimBa2 and the detection of binding pockets using P2Rank.
A list of ligands can be specified in the PDB/MOL2 format. The binding affinity of the protein-ligand interactions can be predicted using AutoDock Vina.
The outputs obtained during each process are aggregated into a MultiQC HTML report. The analysis report presents the results in an interactive manner, including visualizing the three-dimensional protein structure using the iCn3D web viewer. The pipeline is being developed using Nextflow.
IMPORTANT: This project is under active development. Momentarily the pipeline can only be utilized by manually installing the desired packages and tools. Support for Docker/Singularity containers represents a high-priority future update.
- Python â„ 3.8
- Java â„ 11
- git, pip, conda/mamba
- The minimum setup required for running the pipeline. This configuration allows for executing the sequence properties and 3D structure viewer components.
conda config --env --add channels conda-forge
conda config --env --add channels anaconda
conda config --env --add channels bioconda
conda install 'numpy>=1.18.5' 'pandas>=1.4.1' 'biopython>=1.76' 'multiqc>=1.12' pymol-open-source=2.5.0 graphviz
conda install -c salilab dssp=3.0.0
- Install Nextflow and add the executable to PATH:
curl -fsSL | bash
- Clone this repository:
git clone
The following analysis steps are optional and their installation can be skipped if so desired.
- AlphaFold2 is used to predicted the protein structure if a PDB file is not provided. We use a local instance of ColabFold in order to avoid the large databases used by native AlphaFold2 implementation.
- Follow the install intructions at:
- If ColabFold is not installed the user must provide a PDB structure using the --pdb argument if structural analysis is desired. Set skipAlphaFold = true inside nextflow.config.
- The Pfam database is used to match the protein sequence agains protein families.
- Download the Pfam database(~1.5GB) - Pfam-A.hmm Pfam-A.hmm.dat:
- Install the following packages:
conda install pfam_scan perl-json
- Inside the database directory execute:
hmmpress Pfam-A.hmm
- Set path variable inside nextflow.config
params {
- If Pfam is not used set skipPfamSearch = true inside nextflow.config.
- The protein sequence conservation can be computed by generating a Multiple Sequence Alignment using a sequence database.
- Install Blast, Muscle and CD-HIT:
conda install blast muscle cd-hit
Create a BLAST database from which to generate the MSA file. You can use any FASTA database such as SwissProt/TrEMBL/UniRef50/UniRef90 (
Example setting UniRef90 up as a Blast database:
curl {} | gunzip | makeblastdb -out uniref90 -dbtype prot -title UniRef90 -parse_seqids
- Set path variable inside nextflow.config
params {
- warning: SwissProt might not return enough hits to calculate the sequence conservation, downloading large databases such as TrEMBL and UniRef90 might take a lot of time
- If this component is not used set skipConservationMSA = true inside nextflow.config.
- SimBa2 is used to predict the effect of amino-acid substitutions on protein stability.
python -m pip install
- If SimBa2 is not installed set skipStabilityChanges = true inside nextflow.config.
- P2Rank is used to predict ligand-binding pockets from the protein structure.
- Follow the install intructions at:
- Pillow is required for generating the pocket visualizations:
conda install pillow
- If P2Rank is not installed set skipPocketPrediction = true inside nextflow.config.
- AutoDock Vina is used to dock the ligands provided with the --ligands argument.
- Install ADFR suite - MSMS not required
- Install Vina python bindings:
pip install vina
- If the pocket prediction step has been executed beforehand, Vina will dock the ligands against each of the predicted pockets, otherwise it will only execute a blind docking step.
- If Vina is not installed set skipMolecularDocking = true inside nextflow.config.
Usage example:
nextflow run --fasta input.fasta
nextflow run path/to/ --pdb input.pdb --ligands "ligand1.mol2 path/to/ligand2.pdb"
For a full list of parameters run:
nextflow run --help
- Maghiar Octavian-Florin
If you find this work useful please properly cite any of the revelant tools.
AlphaFold2: Jumper, J. et al. (2021) âHighly accurate protein structure prediction with AlphaFoldâ, Nature, 596(7873), pp. 583â589. Available at:
ColabFold: Mirdita, M. et al. (2021) ColabFold - Making protein folding accessible to all. preprint. Bioinformatics. Available at:
P2Rank: KrivĂĄk, R. and Hoksza, D. (2018) âP2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structureâ, Journal of Cheminformatics, 10(1), p. 39. Available at:
Jendele, L. et al. (2019) âPrankWeb: a web server for ligand binding site prediction and visualizationâ, Nucleic Acids Research, 47(W1), pp. W345âW349. Available at:
AutoDock Vina: Eberhardt, J. et al. (2021) âAutoDock Vina 1.2.0: New Docking Methods, Expanded Force Field, and Python Bindingsâ, Journal of Chemical Information and Modeling, 61(8), pp. 3891â3898. Available at:
Trott, O. and Olson, A.J. (2009) âAutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreadingâ, Journal of Computational Chemistry, p. NA-NA. Available at:
SimBa2: BĂŠk, K.T. and Kepp, K.P. (2022) âData set and fitting dependencies when estimating protein mutant stability: Toward simple, balanced, and interpretable modelsâ, Journal of Computational Chemistry, 43(8), pp. 504â518. Available at:
iCn3D: Wang, J. et al. (2020) âiCn3D, a web-based 3D viewer for sharing 1D/2D/3D representations of biomolecular structuresâ, Bioinformatics. Edited by A. Valencia, 36(1), pp. 131â135. Available at:
Nextflow: Di Tommaso, P. et al. (2017) âNextflow enables reproducible computational workflowsâ, Nature Biotechnology, 35(4), pp. 316â319. Available at:
MultiQC: Ewels, P. et al. (2016) âMultiQC: summarize analysis results for multiple tools and samples in a single reportâ, Bioinformatics, 32(19), pp. 3047â3048. Available at:
BLAST: Camacho, C. et al. (2009) âBLAST+: architecture and applicationsâ, BMC Bioinformatics, 10(1), p. 421. Available at:
CD-HIT: Fu, L. et al. (2012) âCD-HIT: accelerated for clustering the next-generation sequencing dataâ, Bioinformatics, 28(23), pp. 3150â3152. Available at:
MUSCLE: Edgar, R.C. (2004) âMUSCLE: multiple sequence alignment with high accuracy and high throughputâ, Nucleic Acids Research, 32(5), pp. 1792â1797. Available at: