Skip to content

A Machine Learning framework to estimate Protein Thermostability from Amino Acid sequences which uses Prot-BERT and Amino Acid Composition Feature Engineering to encode the sequences, and XGBoost and SVR for the regression model.

Notifications You must be signed in to change notification settings

nagshivank/Protein-Thermostability-Prediction

Repository files navigation

Protein Thermostability Prediction

Overview

This repository contains the implementation of machine learning models developed to predict the thermostability of proteins based on their amino acid sequences, which is a key trait that enhances various application-specific functions in computational biology.

Problem Description

Background: Thermostability is a critical feature in proteins that influences their effectiveness in various applications, including enzymatic processes and directed evolution. Accurately predicting protein thermostability poses a significant challenge due to the complex biological factors involved, such as the impact of minor amino acid substitutions on stability.

Motivation: Understanding and predicting thermostability is crucial for protein engineering, especially in scenarios where proteins need to operate under high-temperature conditions or are used as starting points for directed evolution. The ability to predict thermostability could lead to advancements in protein design, with implications for industrial and medical applications.

Dataset: The datasets were curated from a mass spectrometry-based assay measuring protein melting curves. It includes a diverse set of protein sequences that exhibit both global and local variations, providing a comprehensive basis for understanding how sequence variations influence thermostability.

Repository Structure

This repository contains the following scripts:

  1. Final_Model.py
  2. MLP_Regressor.py
  3. XGBoost_RandomForest_Regressor.py
  4. README.md
  5. train.csv
  6. test.csv

1. SVR Model: Final_Model.py

Description

This script implements a Support Vector Regressor (SVR) with an RBF kernel to predict protein thermostability based on amino acid composition features. It was chosen as the final model due to its ability to handle high-dimensional data and model complex, non-linear relationships.

Features

  • Amino Acid Composition: Calculates the proportion of each amino acid in the sequence.
  • Normalization: Uses StandardScaler to normalize the amino acid composition features.
  • Cross-Validation: Uses 5-fold cross-validation to evaluate model performance.
  • Spearman Correlation: Custom function to calculate Spearman correlation as the performance metric.

How to Run

python Final_Model.py

2. MLP Model: MLP_Regressor.py

Description

This script uses a Multi-Layer Perceptron (MLP) to predict protein thermostability. The model starts with embeddings generated by Prot-BERT, a pre-trained transformer model that captures deep contextual representations of protein sequences. These embeddings are used as input features for the MLP.

Features

  • Prot-BERT Embeddings: Utilizes Prot-BERT to embed the protein sequences.
  • MLP Architecture: The MLP consists of three fully connected layers with ReLU activation and dropout for regularization.
  • Batch Normalization: Applied after the first two dense layers to stabilize learning and improve generalization.
  • Training and Validation: The model is trained for 100 epochs using a batch size of 32, with performance tracked on a validation set.

How to Run

python MLP_Regressor.py

3. XGBoost-RandomForest Ensemble Regressor.py: XGBoost_RandomForest_Regressor.py

Description

This script implements an ensemble model combining XGBoost and Random Forest regressors. The ensemble approach leverages the strengths of both models to handle the high-dimensional, sparse k-mer features and hydrophobicity of protein sequences.

Features

  • K-mer Analysis: Generates k-mer features to capture local sequence patterns
  • Hydrophobicity Calculation: Computes the average hydrophobicity of sequences using the Kyte-Doolittle scale.
  • Ensemble Model: Averages the predictions from XGBoost and Random Forest models to improve performance.
  • Spearman Correlation: Custom function to evaluate the ensemble model's performance.

How to Run

python XGBoost_RandomForest_Regressor.py

About

A Machine Learning framework to estimate Protein Thermostability from Amino Acid sequences which uses Prot-BERT and Amino Acid Composition Feature Engineering to encode the sequences, and XGBoost and SVR for the regression model.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages