Skip to content

Commit 8f4b3cb

Browse files
authored
Merge pull request #6 from CDDLeiden/enhancement/merge_contrib
Integrate changes from contrib branch
2 parents 8b5fa9b + 58be23c commit 8f4b3cb

File tree

5 files changed

+132
-73
lines changed

5 files changed

+132
-73
lines changed

README.md

Lines changed: 62 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,30 @@ QSPRpred
66
<img src='figures/QSPRpred_logo.jpg' width=10% align=right>
77
<p align=left width=70%>
88

9-
QSPRpred is open-source software libary for building **Quantitative Structure Property Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug Discovery group. It provides a unified interface for building QSPR models based on different types of descriptors and machine learning algorithms. We developed this package to support our research, recognizing the necessity to reduce repetition in our model building workflow and improve the reproducibility and reusability of our models. In making this package available here, we hope that it may be of use to other researchers as well. QSPRpred is still in active development, and we welcome contributions and feedback from the community.
10-
11-
QSPRpred is designed to be modular and extensible, so that new functionality can be easily added. A command line interface is available for basic use cases to quickly, explore varying scenarios. For more advanced use cases, the Python API offers extra flexibility and control, allowing more complex workflows and additional features.
12-
13-
Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a> and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore, for scikit-learn model saving and loading, QSPRpred uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and interpretable model serialization. QSPRpred is also interoperable with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale curated dataset aimed at bioactivity predictions, for data collection. Models developed with QSPRpred are compatible with the group's *de novo* drug design package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.
9+
QSPRpred is open-source software libary for building **Quantitative Structure Property
10+
Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug
11+
Discovery group. It provides a unified interface for building QSPR models based on
12+
different types of descriptors and machine learning algorithms. We developed this
13+
package to support our research, recognizing the necessity to reduce repetition in our
14+
model building workflow and improve the reproducibility and reusability of our models.
15+
In making this package available here, we hope that it may be of use to other
16+
researchers as well. QSPRpred is still in active development, and we welcome
17+
contributions and feedback from the community.
18+
19+
QSPRpred is designed to be modular and extensible, so that new functionality can be
20+
easily added. A command line interface is available for basic use cases to quickly,
21+
explore varying scenarios. For more advanced use cases, the Python API offers extra
22+
flexibility and control, allowing more complex workflows and additional features.
23+
24+
Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a>
25+
and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore,
26+
for scikit-learn model saving and loading, QSPRpred
27+
uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and
28+
interpretable model serialization. QSPRpred is also interoperable
29+
with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale
30+
curated dataset aimed at bioactivity predictions, for data collection. Models developed
31+
with QSPRpred are compatible with the group's *de novo* drug design
32+
package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.
1433

1534

1635
Quick Start
@@ -21,45 +40,74 @@ Quick Start
2140
QSPRpred can be installed with pip like so (with python >= 3.10):
2241

2342
```bash
24-
pip install git+https://github.com/CDDLeiden/QSPRPred.git@main
43+
pip install git+https://github.com/CDDLeiden/QSPRpred.git@main
2544
```
2645

27-
Note that this will install the basic dependencies, but not the optional dependencies. If you want to use the optional dependencies, you can install the package with an option:
46+
Note that this will install the basic dependencies, but not the optional dependencies.
47+
If you want to use the optional dependencies, you can install the package with an
48+
option:
2849

2950
```bash
30-
pip install git+https://github.com/CDDLeiden/QSPRPred.git@main#egg=qsprpred[<option>]
51+
pip install git+https://github.com/CDDLeiden/QSPRpred.git@main#egg=qsprpred[<option>]
3152
```
3253

3354
The following options are available:
34-
- extra : include extra dependencies for PCM models and extra descriptor sets from packages other than RDKit
55+
56+
- extra : include extra dependencies for PCM models and extra descriptor sets from
57+
packages other than RDKit
3558
- deep : include deep learning models (torch and chemprop)
36-
- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain cude toolkit from Anaconda as well: `conda install cudatoolkit`)
37-
- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))
59+
- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X
60+
with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain
61+
cude toolkit from Anaconda as well: `conda install cudatoolkit`)
62+
- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`,
63+
replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))
3864

3965
### Note on PCM Modelling
4066

41-
If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS only**):
67+
If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make
68+
sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS
69+
only**):
4270

4371
```bash
4472

4573
conda install -c bioconda clustalo
4674
```
75+
4776
or install MAFFT instead:
4877

4978
```bash
5079
conda install -c biocore mafft
5180
```
52-
This is needed to provide multiple sequence alignments for the PCM descriptors. If Windows is your platform of choice, these tools will need to be installed manually or a custom implementation of the `MSAProvider` class will have to be made.
81+
82+
This is needed to provide multiple sequence alignments for the PCM descriptors. If
83+
Windows is your platform of choice, these tools will need to be installed manually or a
84+
custom implementation of the `MSAProvider` class will have to be made.
5385

5486
## Use
55-
After installation, you will have access to various command line features and you can use the Python API directly (see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which document the use of the Python API to build different types of models. The tutorials as well as the [documentation](https://cddleiden.github.io/QSPRPred/docs/use.html) are still work in progress, and we will be happy for any contributions where it is still lacking.
87+
88+
After installation, you will have access to various command line features and you can
89+
use the Python API directly (
90+
see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you
91+
can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which
92+
document the use of the Python API to build different types of models. The tutorials as
93+
well as the documentation are still work in progress, and we will be happy for any
94+
contributions where it is still lacking.
95+
96+
To use the commandline to train the same QSAR model as in the tutorial use (run from
97+
tutorial folder):
98+
99+
```bash
100+
python -m qsprpred.data_CLI -i ./data/parkinsons_pivot.tsv -o qspr/data -pr GABAAalpha -pr NMDA -r true -sp random -sf 0.15 -fe Morgan
101+
python -m qsprpred.model_CLI -dp ./qspr/data/GABAAalpha_REGRESSION_df.pkl -o ./qspr/models -m PLS -o bayes -nt 5 -me -s
102+
```
56103

57104
Workflow
58105
========
59106
![image](figures/QSPRpred_workflow.png)
60107

61108
Current Development Team
62109
========================
110+
63111
- [H. van den Maagdenberg](https://github.com/HellevdM)
64112
- [M. Sicho](https://github.com/martin-sicho)
65113
- [L. Schoenmaker](https://github.com/LindeSchoenmaker)

qsprpred/data/chem/scaffolds.py

Lines changed: 50 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
import pandas as pd
44
from rdkit import Chem
5-
from rdkit.Chem import Mol
5+
from rdkit.Chem import Mol, ReplaceSubstructs
66
from rdkit.Chem.Scaffolds import MurckoScaffold
77

88
from qsprpred.data.processing.mol_processor import MolProcessorWithID
@@ -38,7 +38,8 @@ class Murcko(Scaffold):
3838

3939
def __call__(self, mols, props, *args, **kwargs):
4040
"""
41-
Calculate the Murcko scaffold for a molecule.
41+
Calculate the Murcko scaffold for a molecule as implemented
42+
in RDKit.
4243
4344
Args:
4445
mol: SMILES as `str` or an instance of `Mol`
@@ -58,49 +59,55 @@ def __str__(self):
5859

5960

6061
class BemisMurcko(Scaffold):
61-
"""
62-
Reimplementation of Bemis-Murcko scaffolds based on a function described in the
63-
discussion here: https://sourceforge.net/p/rdkit/mailman/message/37269507/
62+
"""Extension of rdkit's BM-like scaffold to make it more true to the paper.
63+
In BM's paper, exo bonds on linkers or on rings get cutoff but two
64+
electrons remain.
65+
66+
In the rdkit implementation, both atoms in the exo bond get included.
67+
This means for BM C1CCC1=N and C1CCC1=O are the same, for rdkit they are
68+
different.
69+
70+
When flattening the BM scaffold using MakeScaffoldGeneric() this leads to
71+
distinct scaffolds, as C1CCC1=O is flattened to C1CCC1C and not C1CCC1.
6472
65-
This implementation allows more flexibility in terms of what substructures should be
66-
kept, converted, and how.
73+
In this approach, the two electrons are represented as SMILES "=*". This
74+
is to make sure the automatic oxidation state assignment of sulfur does
75+
not flatten C1CS1(=*)(=*) into C1CS1 when explicit hydrogen count is
76+
provided.
6777
68-
Credit: Francois Berenger
69-
Ref.: Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
78+
Ref.:
79+
80+
Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
7081
Molecular frameworks." Journal of medicinal chemistry, 39(15), 2887-2893.
7182
83+
Related RDKit issue: https://github.com/rdkit/rdkit/discussions/6844
84+
85+
Credit: Original code provided by Wim Dehaen (@dehaenw)
7286
"""
7387

7488
def __init__(
7589
self,
76-
convert_hetero=True,
77-
force_single_bonds=True,
78-
remove_terminal_atoms=True,
79-
id_prop=None,
90+
real_bemismurcko: bool = True,
91+
use_csk: bool = False,
92+
id_prop: bool | None = None,
8093
):
8194
"""
8295
Initialize the scaffold generator.
8396
8497
Args:
85-
convert_hetero (bool):
86-
Convert hetero atoms to carbons.
87-
force_single_bonds (bool):
88-
Convert all scaffold's bonds to single ones.
89-
remove_terminal_atoms (bool):
90-
Remove all terminal atoms, keep only
91-
ring linkers.
92-
id_prop (str):
93-
Name of the property that contains the molecule's unique identifier.
98+
real_bemismurcko (bool): Use guidelines from Bemis murcko paper.
99+
otherwise, use native rdkit implementation.
100+
use_csk (bool): Make scaffold generic (convert all bonds to single
101+
and all atoms to carbon). If real_bemismurcko is on, also
102+
remove all flattened exo bonds.
94103
"""
95104
super().__init__(id_prop=id_prop)
96-
self.convertHetero = convert_hetero
97-
self.forceSingleBonds = force_single_bonds
98-
self.removeTerminalAtoms = remove_terminal_atoms
105+
self.realBemisMurcko = real_bemismurcko
106+
self.useCSK = use_csk
99107

100108
@staticmethod
101109
def findTerminalAtoms(mol):
102110
res = []
103-
104111
for a in mol.GetAtoms():
105112
if len(a.GetBonds()) == 1:
106113
res.append(a)
@@ -119,36 +126,23 @@ def __call__(self, mols, props, *args, **kwargs):
119126
res = []
120127
for mol in mols:
121128
mol = Chem.MolFromSmiles(mol) if isinstance(mol, str) else mol
122-
only_HA = Chem.rdmolops.RemoveHs(mol)
123-
rw_mol = Chem.RWMol(only_HA)
124-
125-
# switch all HA to Carbon
126-
if self.convertHetero:
127-
for i in range(rw_mol.GetNumAtoms()):
128-
rw_mol.ReplaceAtom(i, Chem.Atom(6))
129-
130-
# switch all non single bonds to single
131-
if self.forceSingleBonds:
132-
non_single_bonds = []
133-
for b in rw_mol.GetBonds():
134-
if b.GetBondType() != Chem.BondType.SINGLE:
135-
non_single_bonds.append(b)
136-
for b in non_single_bonds:
137-
j = b.GetBeginAtomIdx()
138-
k = b.GetEndAtomIdx()
139-
rw_mol.RemoveBond(j, k)
140-
rw_mol.AddBond(j, k, Chem.BondType.SINGLE)
141-
142-
# as long as there are terminal atoms, remove them
143-
if self.removeTerminalAtoms:
144-
terminal_atoms = self.findTerminalAtoms(rw_mol)
145-
while terminal_atoms:
146-
for a in terminal_atoms:
147-
for b in a.GetBonds():
148-
rw_mol.RemoveBond(b.GetBeginAtomIdx(), b.GetEndAtomIdx())
149-
rw_mol.RemoveAtom(a.GetIdx())
150-
terminal_atoms = self.findTerminalAtoms(rw_mol)
151-
res.append(Chem.MolToSmiles(rw_mol.GetMol()))
129+
Chem.RemoveStereochemistry(mol) # important for canonization !
130+
scaff = MurckoScaffold.GetScaffoldForMol(mol)
131+
132+
if self.realBemisMurcko:
133+
scaff = ReplaceSubstructs(
134+
scaff,
135+
Chem.MolFromSmarts("[$([D1]=[*])]"),
136+
Chem.MolFromSmarts("[*]"),
137+
replaceAll=True,
138+
)[0]
139+
140+
if self.useCSK:
141+
scaff = MurckoScaffold.MakeScaffoldGeneric(scaff)
142+
if self.realBemisMurcko:
143+
scaff = MurckoScaffold.GetScaffoldForMol(scaff)
144+
Chem.SanitizeMol(scaff)
145+
res.append(Chem.MolToSmiles(scaff))
152146
return pd.Series(res, index=props[self.idProp])
153147

154148
def __str__(self):

qsprpred/data/chem/tests.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,9 @@ def setUp(self):
2121
[
2222
("Murcko", Murcko()),
2323
("BemisMurcko", BemisMurcko()),
24+
("BemisMurckoCSK", BemisMurcko(True, True)),
25+
("BemisMurckoJustCSK", BemisMurcko(False, True)),
26+
("BemisMurckoOff", BemisMurcko(False, False)),
2427
]
2528
)
2629
def testScaffoldAdd(self, _, scaffold):

qsprpred/data/sampling/tests.py

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,11 @@ def testTemporalSplit(self, multitask):
110110
@parameterized.expand(
111111
[
112112
(False, Murcko(), None),
113-
(False, BemisMurcko(), ["ScaffoldSplit_000", "ScaffoldSplit_001"]),
113+
(
114+
False,
115+
BemisMurcko(use_csk=True),
116+
["ScaffoldSplit_000", "ScaffoldSplit_001"],
117+
),
114118
(True, Murcko(), None),
115119
]
116120
)

qsprpred/extra/data/utils/testing/path_mixins.py

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import os
2+
import platform
23
import tempfile
34
from typing import Callable
45

@@ -27,6 +28,7 @@
2728
)
2829
from qsprpred.extra.data.tables.pcm import PCMDataSet
2930
from qsprpred.extra.data.utils.msa_calculator import ClustalMSA
31+
from qsprpred.logs import logger
3032
from qsprpred.utils.testing.path_mixins import DataSetsPathMixIn
3133

3234

@@ -44,9 +46,8 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
4446
Returns:
4547
list: list of `MoleculeDescriptorSet` objects
4648
"""
47-
return [
49+
ret = [
4850
Mordred(),
49-
Mold2(),
5051
CDKFP(size=2048, search_depth=7),
5152
CDKExtendedFP(),
5253
CDKEStateFP(),
@@ -62,6 +63,15 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
6263
PaDEL(),
6364
ExtendedValenceSignature(1),
6465
]
66+
if platform.system() != "Darwin":
67+
ret.append(Mold2())
68+
else:
69+
# not supported on macOS
70+
logger.warning(
71+
"Mold2 is not supported on macOS. "
72+
"Skipping Mold2 descriptor set in tests."
73+
)
74+
return ret
6575

6676
@classmethod
6777
def getAllProteinDescriptors(cls) -> list[ProteinDescriptorSet]:

0 commit comments

Comments
 (0)