- Inspired by Moleculenet.ai
- Selection of data sets of molecules (SMILES) and physicochemical properties
- SMILES in the data sets have all been uniformized through the RDKit
- Cluster the data sets at the same place. They are all here!
- Use it for validating the inference of molecular properties through various machine learning models as proposed in Z. Wu et al.
- All data sets are regularized following the RDKit methods to output isomeric, canonical and kekulise SMILES (Daylight)
- If a SMILES was not successfully regularized, a blank replaces the SMILES compared to the original data set
- Quantum Mechanics: QM9
- Physical Chemistry: ESOL, FreeSolv, Lipophilicity
- Biophysics: PCBA, HIV, BACE
- Physiology: BBBP, Tox21, ToxCast, SIDER, ClinTox
From Moleculenet.ai, here are their short description and the task for inference between squared brackets (for the regularized data sets reported here):
-
QM9: Geometric, energetic, electronic and thermodynamic properties of DFT-modelled small molecules [classification]
-
ESOL: Water solubility data(log solubility in mols per litre) for common organic small molecules [regression]
-
FreeSolv: Experimental and calculated hydration free energy of small molecules in water [regression]
-
Lipophilicity: Experimental results of octanol/water distribution coefficient(logD at pH 7.4) [regression]
-
PCBA: Selected from PubChem BioAssay, consisting of measured biological activities of small molecules generated by high-throughput screening [classification]
-
HIV: Experimentally measured abilities to inhibit HIV replication [classification]
-
BACE: Quantitative (IC50) and qualitative (binary label) binding results for a set of inhibitors of human β-secretase 1(BACE-1) [classification/regression]
-
BBBP: Binary labels of blood-brain barrier penetration(permeability) [classification]
-
Tox21: Qualitative toxicity measurements on 12 biological targets, including nuclear receptors and stress response pathways [classification]
-
ToxCast: Toxicology data for a large library of compounds based on in vitro high-throughput screening, including experiments on over 600 tasks [classification]
-
SIDER: Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes [classification]
-
ClinTox: Qualitative data of drugs approved by the FDA and those that have failed clinical trials for toxicity reasons [classification]
Source: Moleculenet.ai
Paper: Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande, MoleculeNet: A Benchmark for Molecular Machine Learning, arXiv: 1703.00564, 2017 [cs.LG]