Skip to content

A character RNN model and a GPT model for reading molecular SMILES strings, and generating new SMILES strings based on learned molecular language.

License

Notifications You must be signed in to change notification settings

MauricioCafiero/MolecularGPTandRNN

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This contains two models: a basic character RNN and a GPT model that both read molecular SMILES strings and train themselves on string completion. The RNN has one embedding layer, one GRU layer, and then a dense output layer (TensorFlow). The GPT model uses the multihead attention layer from Keras. Tokenization of the string is done via the SMILES tokenizer implemented in DeepChem (https://github.com/deepchem/deepchem) and molecule visualization is done using RDKit.

If using this on a new dataset of SMILES strings, use the basic vocab file (vocab.txt, also from DeepChem), and then build your own specific vocab file using the last block in the notebook. You can just use the original vocab file, but training is more efficient with a smaller vocabulary.

In my tests, 50-100 epochs of training on the RNN on the ~6700 item training set gets 92% training accuracy and 73% validation accuracy. The GPT model had 300 epochs of training.

About

A character RNN model and a GPT model for reading molecular SMILES strings, and generating new SMILES strings based on learned molecular language.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published