Protein-Sequence-Classification

Protein Sequence Classification using Machine Learning

This repository contains code for a project that reads protein sequences, performs pairwise alignment, extracts various features, and applies multiple machine learning classifiers to predict protein function. The project utilizes a variety of classifiers, including Random Forest, Support Vector Classifier, Gradient Boosting, and AdaBoost, and aggregates their predictions using a Voting Classifier for improved accuracy.

Installation

Clone the repository:

git clone https://github.com/ade-wagimon/protein-sequence-classification.git
cd protein-sequence-classification

Install the required packages:
```
pip install -r requirements.txt
```

Usage

Ensure your protein sequences are in FASTA format. Update the file paths in the code:

enzyme1_fasta = "CrtB.fasta"
enzyme2_fasta = "crtM.fasta"
alignment_file = "Align.aln"

Run the script to perform sequence alignment, feature extraction, and model training:
```
python main.py
```

Features

Sequence Reading: Reads protein sequences from FASTA files.
Pairwise Alignment: Uses Clustal Omega to perform pairwise alignment of the sequences.
Feature Extraction: Extracts features such as identity, gaps, conservation scores, hydrophobicity, molecular weight, and amino acid composition.
Synthetic Data Generation: Generates synthetic examples for model training.
Missing Value Handling: Uses SimpleImputer to handle any missing values in the dataset.

Machine Learning Models

The script implements the following classifiers:

Random Forest
Support Vector Classifier
Gradient Boosting
AdaBoost

Each classifier undergoes hyperparameter tuning using GridSearchCV. The best models from each classifier are then combined into a Voting Classifier.

Evaluation

The performance of the models is evaluated using the following metrics:

Accuracy
Precision
Recall
F1 Score

Additionally, a confusion matrix is plotted to visualize the performance of the Voting Classifier.

Example Output

Random Forest:
Best parameters found by grid search: {'max_depth': 8, 'n_estimators': 200}
Best cross-validation accuracy: 0.75

Support Vector Classifier:
Best parameters found by grid search: {'C': 1, 'gamma': 0.1}
Best cross-validation accuracy: 0.78

Gradient Boosting Classifier:
Best parameters found by grid search: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200}
Best cross-validation accuracy: 0.76

AdaBoost Classifier:
Best parameters found by grid search: {'learning_rate': 1, 'n_estimators': 100}
Best cross-validation accuracy: 0.74

Overall Model Performance:
Accuracy: 0.80
Precision: 0.81
Recall: 0.80
F1 Score: 0.80

Classification Report:
               precision    recall  f1-score   support

    Function1       0.80      0.81      0.80        20
    Function2       0.81      0.80      0.80        20

    accuracy                           0.80        40
   macro avg       0.80      0.80      0.80        40
weighted avg       0.80      0.80      0.80        40

Confusion Matrix:

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.

License

This project is licensed under the MIT License. See the LICENSE file for details.

Bonus

  +--------------------------+
  | Read Protein Sequences   |
  | (CrtB.fasta, crtM.fasta) |
  +--------------------------+
               |
               v
  +----------------------------+
  | Perform Pairwise Alignment |
  | (Clustal Omega)            |
  |      Align.aln             |
  +----------------------------+
               |
               v
  +--------------------------+
  | Assign Labels based on   |
  | Sequence Names           |
  | (Function1 or Function2) |
  +--------------------------+
               |
               v
  +------------------------------------+
  | Feature Encoding                   |
  | - One-hot Encoding for Amino Acids |
  | - Encode Additional Features       |
  |   (Hydrophobicity, MW, etc.)       |
  +------------------------------------+
               |
               v
  +---------------------------+
  | Split Data into           |
  | Training and Testing Sets |
  | (train_test_split)        |
  +---------------------------+
               |
               v
  +---------------------------+
  | Scale Data (MinMaxScaler) |
  +---------------------------+
               |
               v
  +-----------------------------------------+
  | Build CNN Model                         |
  | - Conv2D -> MaxPooling2D                |
  | - Conv2D -> MaxPooling2D                |
  | - Flatten -> Dense -> Dropout -> Output |
  +-----------------------------------------+
               |
               v
  +-----------------------------+
  | Compile Model               |
  | (Adam, binary_crossentropy) |
  +-----------------------------+
               |
               v
  +----------------------------+
  | Train Model                |
  | (epochs=10, batch_size=16, |
  | validation_split=0.2)      |
  +----------------------------+
               |
               v
  +------------------------+
  | Evaluate Model         |
  | on Test Set (Accuracy) |
  +------------------------+
               |
               v
  +-------------------------+
  | Generate Predictions    |
  | - Thresholding          |
  | - Compute Metrics       |
  | - Classification Report |
  | - Confusion Matrix      |
  +-------------------------+
               |
               v
  +------------------------------+
  | Visualize Model              |
  | - plot_model                 |
  | - Confusion Matrix (heatmap) |
  | - Training History (plots)   |
  +------------------------------+

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
Version_2(gpt).py		Version_2(gpt).py
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein-Sequence-Classification