There are 3 Jupyter Notebooks:
- Data_Exploration_Preparation
- ML_Pipeline
- Run_Test_Set
Table of Contents
-
Import and inspect dataset
-
Protein Embedding
- Embed Uniprot IDs using ProtT5 model (prottrans_t5_xl_u50)
-
Molecule Embedding
- Get SMILES representation from PubChem
- Convert SMILES to Embedding
- RDKit for Morgan Fingerprints
- Alternative option, using BERT model (not used in current solution)
-
Preparing data for Machine Learning Pipeline
- Normalizing and saving embedded data
LGBM model gets trained on 80% of the data and performs as follows on the held out 20%:
- Best number of estimators: 215
- Mean Squared Error: 391223160770.29596
- R^2 Score: 0.1777882974932672
A new test set can be loaded and predictions can be made using the trained model.
An R² score of 0.177 is relatively low but indicates that the model is capturing some signal from the data. Further optimization and exploration are necessary to improve performance.
Cross-validation should be conducted to validate and generalize this performance. Additional analysis can determine whether using estimated or non-estimated KIBA scores impacts the model's performance.
Feature Engineering
- Incorporate additional features such as Kd, Ki, and IC50 scores.
Enhanced Embeddings
- Experiment with more informative embeddings, such as:
- PubChem10M_SMILES_BERT
- DeepChem
- AlphaFold embeddings
- ProtTrans
Implement deep learning techniques, including:
- Graph Neural Networks (GNNs) for representing molecular and protein structures.
- End-to-end models for drug-target binding prediction, such as DeepAffinity.
By incorporating these strategies, the model's performance can likely be enhanced, leading to more robust predictions.