This adds support for a couple of chemistry datatypes to intake
You'll need to have both intake and the rdkit installed in your environment:
conda install -c conda-forge intake rdkit
And then you can pip install
intake-rdkit directly from this repo:
python -m pip install git+https://github.com/greglandrum/intake-rdkit.git
Here's reading a compressed CSV file (of course the file doesn't have to be compressed):
>>> ds = intake.open_smiles('./files/CHEMBL1821_Ki_set.csv.gz',smilesColumn='canonical_smiles')
>>> df = ds.read()
>>> df.head()
mol chembl_id ... pchembl_value doc_id
0 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5e6490> CHEMBL1794855 ... NaN 6491
1 <rdkit.Chem.rdchem.Mol object at 0x7fa5df6bc6c0> CHEMBL2112955 ... 7.85 16845
2 <rdkit.Chem.rdchem.Mol object at 0x7fa5df6bc760> CHEMBL2112957 ... 8.64 16845
3 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5e6620> CHEMBL369062 ... 7.22 16845
4 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5e6580> CHEMBL2112961 ... 7.37 16845
[5 rows x 7 columns]
And here's an SDF (you can read .sdf.gz
too):
>>> ds = intake.open_sdf('./files/jm200186n.sdf')
>>> df = ds.read()
>>> df.head()
mol scaffold
0 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5763f0> 1.0
1 <rdkit.Chem.rdchem.Mol object at 0x7fa5df576080> NaN
2 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5765d0> NaN
3 <rdkit.Chem.rdchem.Mol object at 0x7fa5df576760> NaN
4 <rdkit.Chem.rdchem.Mol object at 0x7fa5df5760d0> NaN
Note that calling ds.read()
parses all the molecules in the dataset and reads
them into a pandas DataFrame, so be careful with big data files.
This is more interesting.
Here's the catalog I'm working with, which I have saved in a file called literature.yaml
:
metadata:
version: 1
creator:
name: greg landrum
email: greg.landrum@t5informatics.com
summary: |
Collection of datasets pulled from the literature
sources:
cdk2_project:
description: screening results and synthesized compounds for a CDK2 project.
args:
filename: '{{ CATALOG_DIR }}/files/jm020472j_2.csv.gz'
smilesColumn: Smiles
metadata:
journal_url: https://pubs.acs.org/doi/10.1021/jm020472j
additional_notes: |
The scaffold column is a manually assignment to chemical series.
The sourcepool column indicates whether the compound comes from the
screening deck (divscreen) or was synthesized for the project
(synscreen)
driver: intake_rdkit.smiles.SmilesSource
platinum_2017:
description: Platinum 2017 set for testing conformation generators
args:
filename: '{{ CATALOG_DIR }}/files/platinum_dataset_2017_01.sdf.gz'
metadata:
journal_url: https://pubs.acs.org/doi/10.1021/acs.jcim.6b00613
additional_notes: |
This is the 2017 update of the platinum set as described here:
https://pubs.acs.org/doi/10.1021/acs.jcim.7b00505
driver: intake_rdkit.sdf.SDFSource
And here's how I work with it:
>>> cat = intake.open_catalog('./literature.yaml')
>>> cat.metadata
{'version': 1, 'creator': {'name': 'greg landrum', 'email': 'greg.landrum@t5informatics.com'}, 'summary': 'Collection of datasets pulled from the literature\n'}
>>> for entry in cat:
... print(entry)
...
cdk2_project
platinum_2017
>>> cat.platinum_2017
sources:
platinum_2017:
args:
filename: /scratch/cheminformatics_datasets/.//files/platinum_dataset_2017_01.sdf.gz
description: Platinum 2017 set for testing conformation generators
driver: intake_rdkit.sdf.SDFSource
metadata:
additional_notes: 'This is the 2017 update of the platinum set as described
here: https://pubs.acs.org/doi/10.1021/acs.jcim.7b00505'
catalog_dir: /scratch/cheminformatics_datasets/./
journal_url: https://pubs.acs.org/doi/10.1021/acs.jcim.6b00613
>>> df = cat.platinum_2017.read()
>>> len(df)
4548
>>> df.head()
mol
0 <rdkit.Chem.rdchem.Mol object at 0x7fa5df6205d0>
1 <rdkit.Chem.rdchem.Mol object at 0x7fa5df620a80>
2 <rdkit.Chem.rdchem.Mol object at 0x7fa5df620670>
3 <rdkit.Chem.rdchem.Mol object at 0x7fa5df620210>
4 <rdkit.Chem.rdchem.Mol object at 0x7fa5f8c69e40>