Protein Thermostability Prediction

Overview

This repository contains the implementation of machine learning models developed to predict the thermostability of proteins based on their amino acid sequences, which is a key trait that enhances various application-specific functions in computational biology.

Problem Description

Background: Thermostability is a critical feature in proteins that influences their effectiveness in various applications, including enzymatic processes and directed evolution. Accurately predicting protein thermostability poses a significant challenge due to the complex biological factors involved, such as the impact of minor amino acid substitutions on stability.

Motivation: Understanding and predicting thermostability is crucial for protein engineering, especially in scenarios where proteins need to operate under high-temperature conditions or are used as starting points for directed evolution. The ability to predict thermostability could lead to advancements in protein design, with implications for industrial and medical applications.

Dataset: The datasets were curated from a mass spectrometry-based assay measuring protein melting curves. It includes a diverse set of protein sequences that exhibit both global and local variations, providing a comprehensive basis for understanding how sequence variations influence thermostability.

Repository Structure

This repository contains the following scripts:

Final_Model.py
MLP_Regressor.py
XGBoost_RandomForest_Regressor.py
README.md
train.csv
test.csv

1. SVR Model: `Final_Model.py`

Description

This script implements a Support Vector Regressor (SVR) with an RBF kernel to predict protein thermostability based on amino acid composition features. It was chosen as the final model due to its ability to handle high-dimensional data and model complex, non-linear relationships.

Features

Amino Acid Composition: Calculates the proportion of each amino acid in the sequence.
Normalization: Uses StandardScaler to normalize the amino acid composition features.
Cross-Validation: Uses 5-fold cross-validation to evaluate model performance.
Spearman Correlation: Custom function to calculate Spearman correlation as the performance metric.

How to Run

python Final_Model.py

2. MLP Model: `MLP_Regressor.py`

Description

This script uses a Multi-Layer Perceptron (MLP) to predict protein thermostability. The model starts with embeddings generated by Prot-BERT, a pre-trained transformer model that captures deep contextual representations of protein sequences. These embeddings are used as input features for the MLP.

Features

Prot-BERT Embeddings: Utilizes Prot-BERT to embed the protein sequences.
MLP Architecture: The MLP consists of three fully connected layers with ReLU activation and dropout for regularization.
Batch Normalization: Applied after the first two dense layers to stabilize learning and improve generalization.
Training and Validation: The model is trained for 100 epochs using a batch size of 32, with performance tracked on a validation set.

How to Run

python MLP_Regressor.py

3. XGBoost-RandomForest Ensemble Regressor.py: `XGBoost_RandomForest_Regressor.py`

Description

This script implements an ensemble model combining XGBoost and Random Forest regressors. The ensemble approach leverages the strengths of both models to handle the high-dimensional, sparse k-mer features and hydrophobicity of protein sequences.

Features

K-mer Analysis: Generates k-mer features to capture local sequence patterns
Hydrophobicity Calculation: Computes the average hydrophobicity of sequences using the Kyte-Doolittle scale.
Ensemble Model: Averages the predictions from XGBoost and Random Forest models to improve performance.
Spearman Correlation: Custom function to evaluate the ensemble model's performance.

How to Run

python XGBoost_RandomForest_Regressor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Thermostability Prediction

Overview

Problem Description

Repository Structure

1. SVR Model: `Final_Model.py`

Description

Features

How to Run

2. MLP Model: `MLP_Regressor.py`

Description

Features

How to Run

3. XGBoost-RandomForest Ensemble Regressor.py: `XGBoost_RandomForest_Regressor.py`

Description

Features

How to Run

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Final_Model.py		Final_Model.py
MLP_Regressor.py		MLP_Regressor.py
README.md		README.md
XGBoost_RandomForest_Regressor.py		XGBoost_RandomForest_Regressor.py
test.csv		test.csv
train.csv		train.csv

nagshivank/Protein-Thermostability-Prediction

Folders and files

Latest commit

History

Repository files navigation

Protein Thermostability Prediction

Overview

Problem Description

Repository Structure

1. SVR Model: Final_Model.py

Description

Features

How to Run

2. MLP Model: MLP_Regressor.py

Description

Features

How to Run

3. XGBoost-RandomForest Ensemble Regressor.py: XGBoost_RandomForest_Regressor.py

Description

Features

How to Run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. SVR Model: `Final_Model.py`

2. MLP Model: `MLP_Regressor.py`

3. XGBoost-RandomForest Ensemble Regressor.py: `XGBoost_RandomForest_Regressor.py`

Packages