This repository contains the implementation of machine learning models developed to predict the thermostability of proteins based on their amino acid sequences, which is a key trait that enhances various application-specific functions in computational biology.
Background: Thermostability is a critical feature in proteins that influences their effectiveness in various applications, including enzymatic processes and directed evolution. Accurately predicting protein thermostability poses a significant challenge due to the complex biological factors involved, such as the impact of minor amino acid substitutions on stability.
Motivation: Understanding and predicting thermostability is crucial for protein engineering, especially in scenarios where proteins need to operate under high-temperature conditions or are used as starting points for directed evolution. The ability to predict thermostability could lead to advancements in protein design, with implications for industrial and medical applications.
Dataset: The datasets were curated from a mass spectrometry-based assay measuring protein melting curves. It includes a diverse set of protein sequences that exhibit both global and local variations, providing a comprehensive basis for understanding how sequence variations influence thermostability.
This repository contains the following scripts:
Final_Model.py
MLP_Regressor.py
XGBoost_RandomForest_Regressor.py
README.md
train.csv
test.csv
This script implements a Support Vector Regressor (SVR) with an RBF kernel to predict protein thermostability based on amino acid composition features. It was chosen as the final model due to its ability to handle high-dimensional data and model complex, non-linear relationships.
- Amino Acid Composition: Calculates the proportion of each amino acid in the sequence.
- Normalization: Uses
StandardScaler
to normalize the amino acid composition features. - Cross-Validation: Uses 5-fold cross-validation to evaluate model performance.
- Spearman Correlation: Custom function to calculate Spearman correlation as the performance metric.
python Final_Model.py
This script uses a Multi-Layer Perceptron (MLP) to predict protein thermostability. The model starts with embeddings generated by Prot-BERT, a pre-trained transformer model that captures deep contextual representations of protein sequences. These embeddings are used as input features for the MLP.
- Prot-BERT Embeddings: Utilizes Prot-BERT to embed the protein sequences.
- MLP Architecture: The MLP consists of three fully connected layers with ReLU activation and dropout for regularization.
- Batch Normalization: Applied after the first two dense layers to stabilize learning and improve generalization.
- Training and Validation: The model is trained for 100 epochs using a batch size of 32, with performance tracked on a validation set.
python MLP_Regressor.py
This script implements an ensemble model combining XGBoost and Random Forest regressors. The ensemble approach leverages the strengths of both models to handle the high-dimensional, sparse k-mer features and hydrophobicity of protein sequences.
- K-mer Analysis: Generates k-mer features to capture local sequence patterns
- Hydrophobicity Calculation: Computes the average hydrophobicity of sequences using the Kyte-Doolittle scale.
- Ensemble Model: Averages the predictions from XGBoost and Random Forest models to improve performance.
- Spearman Correlation: Custom function to evaluate the ensemble model's performance.
python XGBoost_RandomForest_Regressor.py