protein-molecule-binding-challenge

There are 3 Jupyter Notebooks:

Data_Exploration_Preparation
ML_Pipeline
Run_Test_Set

Data_Exploration_Preparation.ipynb

Table of Contents

Import and inspect dataset
Protein Embedding
- Embed Uniprot IDs using ProtT5 model (prottrans_t5_xl_u50)
Molecule Embedding
- Get SMILES representation from PubChem
- Convert SMILES to Embedding
  - RDKit for Morgan Fingerprints
  - Alternative option, using BERT model (not used in current solution)
Preparing data for Machine Learning Pipeline
- Normalizing and saving embedded data

ML_Pipeline.ipynb

LGBM model gets trained on 80% of the data and performs as follows on the held out 20%:

Best number of estimators: 215
Mean Squared Error: 391223160770.29596
R^2 Score: 0.1777882974932672

Run_Test_Set.ipynb

A new test set can be loaded and predictions can be made using the trained model.

Conclusion and Remarks:

An R² score of 0.177 is relatively low but indicates that the model is capturing some signal from the data. Further optimization and exploration are necessary to improve performance.

Notes

Cross-validation should be conducted to validate and generalize this performance. Additional analysis can determine whether using estimated or non-estimated KIBA scores impacts the model's performance.

Potential Next Steps to Improve Performance

Feature Engineering

Incorporate additional features such as Kd, Ki, and IC50 scores.

Enhanced Embeddings

Experiment with more informative embeddings, such as:
- PubChem10M_SMILES_BERT
- DeepChem
- AlphaFold embeddings
- ProtTrans

Explore Advanced Models

Implement deep learning techniques, including:

Graph Neural Networks (GNNs) for representing molecular and protein structures.
End-to-end models for drug-target binding prediction, such as DeepAffinity.

By incorporating these strategies, the model's performance can likely be enhanced, leading to more robust predictions.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Data		Data
Embeddings		Embeddings
Models		Models
.gitattributes		.gitattributes
.gitignore		.gitignore
Data_Exploration_Preparation.ipynb		Data_Exploration_Preparation.ipynb
ML_Pipeline.ipynb		ML_Pipeline.ipynb
README.md		README.md
Run_Test_Set.ipynb		Run_Test_Set.ipynb
fetch_smiles.py		fetch_smiles.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

protein-molecule-binding-challenge

Data_Exploration_Preparation.ipynb

ML_Pipeline.ipynb

Run_Test_Set.ipynb

Conclusion and Remarks:

Notes

Potential Next Steps to Improve Performance

Explore Advanced Models

About

Releases

Packages

Languages

lenlan/protein-molecule-binding-challenge

Folders and files

Latest commit

History

Repository files navigation

protein-molecule-binding-challenge

Data_Exploration_Preparation.ipynb

ML_Pipeline.ipynb

Run_Test_Set.ipynb

Conclusion and Remarks:

Notes

Potential Next Steps to Improve Performance

Explore Advanced Models

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages