You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The goal is to use my graph-based genetic algorithm (GA) to maximise pIC50 values predicted using machine learning (ML). Here are some preliminary results.
ML model 1 (ML1)
Random Forest (RF) model using 1024-bit ECFP4 fingerprints. Training was done with PyCaret.
Trained on all Series 4 pIC50 values in the training set (except 3 B-containing compounds for which the SMILES strings didn’t resolve).
The model has a precision of 91% on the holdout minus the B-compound but plus the 4 new experimentally validated molecules from the competition.
The ML models cannot predict pIC50 values that are significantly larger than those predicted for the training set, i.e. around 7. For new molecules, predicted pIC50 values close to 7 are thus potentially lower bound
I also made a ML model (ML1_65) where I removed molecules with pIC50 > 6.5 from the training set.
GA searches for molecules with large pIC50 values
The starting population is drawn from the training set for ML1_65. The population size is 50 and the search is run for 150 generations.
A GA search using ML1_65 to predict pIC50 values finds the top molecules shown below together with their predicted pIC50 values.
Note that two of the molecules are already known, i.e. rediscovered by the search, and have significantly larger pIC50 values than those predicted by the ML1_65 model.
Thus, the GA/ML approach can be used to identify potentially high scoring molecules.
A GA search using ML1 to predict pIC50 values finds the top 10 molecules shown below together with their predicted pIC50 values.
How to select the best molecules, if any, for further study?
Molecules with pIC50 values greater than ca 6.7 should not be eliminated based on the pIC50s
The GA search is stochastic so each new search tends to contribute more new molecules, so there are potentially hundreds of such molecules.
Should I try to filter them further? For example, by predicted solubility, synthetic accessibility, and/or RML. All these models have errors associated with them, so it may weed out potentially useful candidates.
Or is it better to post them all and have experts decide? Also, each molecule could potentially be tweaked a bit based on expert knowledge without impacting the predicted pCI50s significantly.
The text was updated successfully, but these errors were encountered:
Hi @jhjensen2, based on gut feeling of the compounds above, I suspect that adding another phenyl ring to either side of the molecule will end up decreasing the potency. It would be great if you could add filters for solubility and synthetic accessibility to see what kind of compounds you get after that?
The goal is to use my graph-based genetic algorithm (GA) to maximise pIC50 values predicted using machine learning (ML). Here are some preliminary results.
ML model 1 (ML1)
GA searches for molecules with large pIC50 values
How to select the best molecules, if any, for further study?
The text was updated successfully, but these errors were encountered: