GitHub - MauricioCafiero/MolecularGPTandRNN: A character RNN model and a GPT model for reading molecular SMILES strings, and generating new SMILES strings based on learned molecular language.

This contains two models: a basic character RNN and a GPT model that both read molecular SMILES strings and train themselves on string completion. The RNN has one embedding layer, one GRU layer, and then a dense output layer (TensorFlow). The GPT model uses the multihead attention layer from Keras. Tokenization of the string is done via the SMILES tokenizer implemented in DeepChem (https://github.com/deepchem/deepchem) and molecule visualization is done using RDKit.

If using this on a new dataset of SMILES strings, use the basic vocab file (vocab.txt, also from DeepChem), and then build your own specific vocab file using the last block in the notebook. You can just use the original vocab file, but training is more efficient with a smaller vocabulary.

In my tests, 50-100 epochs of training on the RNN on the ~6700 item training set gets 92% training accuracy and 73% validation accuracy. The GPT model had 300 epochs of training.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
6731-azo.csv		6731-azo.csv
LICENSE		LICENSE
MolCharRNN.ipynb		MolCharRNN.ipynb
Molecular_GTP-GC.ipynb		Molecular_GTP-GC.ipynb
README.md		README.md
vocab.txt		vocab.txt
vocab_new.txt		vocab_new.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

MauricioCafiero/MolecularGPTandRNN

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages