Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Genetic algorithm optimisation of Series 4 potency #32

Open
jhjensen2 opened this issue Mar 19, 2021 · 1 comment
Open

Genetic algorithm optimisation of Series 4 potency #32

jhjensen2 opened this issue Mar 19, 2021 · 1 comment

Comments

@jhjensen2
Copy link

jhjensen2 commented Mar 19, 2021

The goal is to use my graph-based genetic algorithm (GA) to maximise pIC50 values predicted using machine learning (ML). Here are some preliminary results.

ML model 1 (ML1)

  • Random Forest (RF) model using 1024-bit ECFP4 fingerprints. Training was done with PyCaret.
  • Trained on all Series 4 pIC50 values in the training set (except 3 B-containing compounds for which the SMILES strings didn’t resolve).
  • The model has a precision of 91% on the holdout minus the B-compound but plus the 4 new experimentally validated molecules from the competition.
  • The model is available here
  • The ML models cannot predict pIC50 values that are significantly larger than those predicted for the training set, i.e. around 7. For new molecules, predicted pIC50 values close to 7 are thus potentially lower bound
  • I also made a ML model (ML1_65) where I removed molecules with pIC50 > 6.5 from the training set.

GA searches for molecules with large pIC50 values

  • The starting population is drawn from the training set for ML1_65. The population size is 50 and the search is run for 150 generations.
  • A GA search using ML1_65 to predict pIC50 values finds the top molecules shown below together with their predicted pIC50 values.
  • Note that two of the molecules are already known, i.e. rediscovered by the search, and have significantly larger pIC50 values than those predicted by the ML1_65 model.
  • Thus, the GA/ML approach can be used to identify potentially high scoring molecules.

Screenshot 2021-03-19 at 15 29 51

  • A GA search using ML1 to predict pIC50 values finds the top 10 molecules shown below together with their predicted pIC50 values.

Screenshot 2021-03-19 at 15 31 51

How to select the best molecules, if any, for further study?

  • Molecules with pIC50 values greater than ca 6.7 should not be eliminated based on the pIC50s
  • The GA search is stochastic so each new search tends to contribute more new molecules, so there are potentially hundreds of such molecules.
  • Should I try to filter them further? For example, by predicted solubility, synthetic accessibility, and/or RML. All these models have errors associated with them, so it may weed out potentially useful candidates.
  • Or is it better to post them all and have experts decide? Also, each molecule could potentially be tweaked a bit based on expert knowledge without impacting the predicted pCI50s significantly.
@edwintse
Copy link
Collaborator

Hi @jhjensen2, based on gut feeling of the compounds above, I suspect that adding another phenyl ring to either side of the molecule will end up decreasing the potency. It would be great if you could add filters for solubility and synthetic accessibility to see what kind of compounds you get after that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants