In case you would like to cite this:
- the following datasets are reported in the paper of
"Out-of-the-Box Deep Learning Prediction of Pharmaceutical Properties by Broadly Learned Knowledge-Based Molecular Representations"
, please find details of these datasets in this paper
Data Class | Dataset | No. of Molecules | No. of Tasks | Task Metric | Task Type |
---|---|---|---|---|---|
Physico-chemical | ESOL Water solubility | 1128 | 1 | RMSE | Regression |
FreeSolv Solvation free energy | 642 | 1 | RMSE | Regression | |
Lipop Lipophilicity | 4200 | 1 | RMSE | Regression | |
Molecular binding | PDBbind-F, PDBbind-C, PDBbind-R Ligand-protein binding: full, core, refined (3 datasets) | 9880, 168, 3040 | 1 for each | RMSE | Regression |
Bio-activity | PCBA PubChem HTS bioAssay | 437929 | 128 | PRC-AUC | Classification |
MUV PubChem bioAssay | 93087 | 17 | PRC-AUC | Classification | |
ChEMBL bioassay activity dataset | 456331 | 1310 | ROC_AUC | Classification | |
Cancer cell-line IC50 A2780, CCRF-CEM12, DU-14512, HCT-1512, KB12, LoVo12, PC-312, SK-OV-312 (8 datasets) | 2255, 3047, 2512,994, 2731, 1120, 4294, 1589 | 1 for each | R2 | Regression | |
Malaria Anti-malarial EC508 | 9998 | 1 | RMSE | Regression | |
BACE-1 benchmark set, ChEMBL novel set, ChEMBL common set, Clinical drugs | 1513, 395, 5324, 26 | 1 | ROC_AUC | Classification | |
HIV replication inhibition | 41127 | 1 | ROC_AUC | Classification | |
Toxicity | Tox21Toxicology in the 21st century | 7831 | 12 | ROC_AUC | Classification |
SIDER Adverse drug reactions of marketed drugs | 1427 | 27 | ROC_AUC | Classification | |
ClinTox Clinical trial toxicity | 1478 | 2 | ROC_AUC | Classification | |
Pharmacokinetic | CYP PubChem BioAssay CYP 1A2, 2C9, 2C19, 2D6, 3A4 inhibition | 16896 | 5 | ROC_AUC | Classification |
LMC-H, LMC-R, LMC-M (Liver Mocrosomal Clearance in human, rat, mouse) | 8755 | 3 | R2 | Regression | |
BBBP Blood-brain barrier penetration | 2039 | 1 | ROC_AUC | Classification |
These benchmark datasets and the split induces have benn generated in this repo, the following table is the summary of these datasets.
task_name | task_type | n_samples | n_task | split_method | n_cross_split | task_metrics | |
---|---|---|---|---|---|---|---|
task_id | |||||||
01 | ESOL | regression | 1128 | 1 | random | 3 | RMSE |
02 | FreeSolv | regression | 642 | 1 | random | 3 | RMSE |
03 | Lipop | regression | 4200 | 1 | random | 3 | RMSE |
04 | PDBbind-full | regression | 9880 | 1 | time | 1 | RMSE |
05 | PDBbind-core | regression | 168 | 1 | time | 1 | RMSE |
06 | PDBbind-refined | regression | 3040 | 1 | time | 1 | RMSE |
07 | PCBA | classification | 437929 | 128 | random | 3 | PRC_AUC |
08 | MUV | classification | 93087 | 17 | random | 3 | PRC_AUC |
09 | HIV | classification | 41127 | 1 | scaffold | 3 | ROC_AUC |
10 | BACE | classification | 1513 | 1 | scaffold | 3 | ROC_AUC |
11 | BBBP | classification | 2039 | 1 | scaffold | 3 | ROC_AUC |
12 | Tox21 | classification | 7831 | 12 | random | 3 | ROC_AUC |
13 | ToxCast | classification | 8576 | 617 | random | 3 | ROC_AUC |
14 | SIDER | classification | 1427 | 27 | random | 3 | ROC_AUC |
15 | ClinTox | classification | 1478 | 2 | random | 3 | ROC_AUC |
16 | ChEMBL | classification | 456331 | 1310 | scaffold | 3 | ROC_AUC |
Direct installation:
pip install git+https://github.com/shenwanxiang/ChemBench.git
Developer installation:
git clone https://github.com/shenwanxiang/ChemBench.git
cd ChemBench
pip install -e .
from chembench import load_data
df, induces = load_data('ESOL')
# get the 3 times random split induces
train_idx, valid_idx, test_idx = induces[0]
train_idx, valid_idx, test_idx = induces[1]
train_idx, valid_idx, test_idx = induces[2]
from chembench import dataset
data = dataset.load_ESOL()
data.x
data.y
data.description
## regression
dataset.load_Lipop()
dataset.load_ESOL()
dataset.load_FreeSolv()
dataset.load_Malaria()
dataset.load_LMC()
dataset.load_PDBF()
dataset.load_PDBC()
dataset.load_PDBR()
### classification
dataset.load_BBBP()
dataset.load_BACE()
dataset.load_HIV()
dataset.load_MUV()
dataset.load_Tox21()
dataset.load_SIDER()
dataset.load_CYP450()
dataset.load_ToxCast()
dataset.load_ClinTox()
dataset.load_ChEMBL()
dataset.load_PCBA()
the cluster split results is here, for example, load the cluster splits and random splits for dataset ESOL:
from chembench import get_cluster_induces
induces1 = get_cluster_induces("ESOL", induces = "random_5fcv_5rpts")
induces2 = get_cluster_induces("ESOL", induces = "scaffold_5fcv_1rpts")
print(len(induces1))
print(len(induces2))
For example, the chemical space of the ESOL dataset using 5fold cluster split :
the Kolmogorov-Smirnov statistic on the distribution for the pairwise groups(clusters):
After installing the package in development mode and installing
tox
with pip install tox
, the commands for making a new release are contained within the finish
environment
in tox.ini
. Run the following from the shell:
$ tox -e finish
This script does the following:
- Uses BumpVersion to switch the version number in the
setup.cfg
andsrc/chembench/version.py
to not have the-dev
suffix - Packages the code in both a tar archive and a wheel
- Uploads to PyPI using
twine
. Be sure to have a.pypirc
file configured to avoid the need for manual input at this step - Push to GitHub. You'll need to make a release going with the commit where the version was bumped.
- Bump the version to the next patch. If you made big changes and want to bump the version by minor, you can
use
tox -e bumpversion minor
after.