Genetic algorithm optimisation of Series 4 potency #32

jhjensen2 · 2021-03-19T14:41:25Z

The goal is to use my graph-based genetic algorithm (GA) to maximise pIC50 values predicted using machine learning (ML). Here are some preliminary results.

ML model 1 (ML1)

Random Forest (RF) model using 1024-bit ECFP4 fingerprints. Training was done with PyCaret.
Trained on all Series 4 pIC50 values in the training set (except 3 B-containing compounds for which the SMILES strings didn’t resolve).
The model has a precision of 91% on the holdout minus the B-compound but plus the 4 new experimentally validated molecules from the competition.
The model is available here
The ML models cannot predict pIC50 values that are significantly larger than those predicted for the training set, i.e. around 7. For new molecules, predicted pIC50 values close to 7 are thus potentially lower bound
I also made a ML model (ML1_65) where I removed molecules with pIC50 > 6.5 from the training set.

GA searches for molecules with large pIC50 values

The starting population is drawn from the training set for ML1_65. The population size is 50 and the search is run for 150 generations.
A GA search using ML1_65 to predict pIC50 values finds the top molecules shown below together with their predicted pIC50 values.
Note that two of the molecules are already known, i.e. rediscovered by the search, and have significantly larger pIC50 values than those predicted by the ML1_65 model.
Thus, the GA/ML approach can be used to identify potentially high scoring molecules.

A GA search using ML1 to predict pIC50 values finds the top 10 molecules shown below together with their predicted pIC50 values.

How to select the best molecules, if any, for further study?

Molecules with pIC50 values greater than ca 6.7 should not be eliminated based on the pIC50s
The GA search is stochastic so each new search tends to contribute more new molecules, so there are potentially hundreds of such molecules.
Should I try to filter them further? For example, by predicted solubility, synthetic accessibility, and/or RML. All these models have errors associated with them, so it may weed out potentially useful candidates.
Or is it better to post them all and have experts decide? Also, each molecule could potentially be tweaked a bit based on expert knowledge without impacting the predicted pCI50s significantly.

edwintse · 2021-04-27T11:00:25Z

Hi @jhjensen2, based on gut feeling of the compounds above, I suspect that adding another phenyl ring to either side of the molecule will end up decreasing the potency. It would be great if you could add filters for solubility and synthetic accessibility to see what kind of compounds you get after that?

jhjensen2 mentioned this issue Apr 13, 2021

Mur Ligase Online Research Meeting 2pm April 13th 2021 opensourceantibiotics/murligase#42

Closed

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Genetic algorithm optimisation of Series 4 potency #32

Genetic algorithm optimisation of Series 4 potency #32

jhjensen2 commented Mar 19, 2021 •

edited

Loading

edwintse commented Apr 27, 2021

Genetic algorithm optimisation of Series 4 potency #32

Genetic algorithm optimisation of Series 4 potency #32

Comments

jhjensen2 commented Mar 19, 2021 • edited Loading

edwintse commented Apr 27, 2021

jhjensen2 commented Mar 19, 2021 •

edited

Loading