Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate changes from contrib branch #6

Merged
merged 6 commits into from
Jan 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 62 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,30 @@ QSPRpred
<img src='figures/QSPRpred_logo.jpg' width=10% align=right>
<p align=left width=70%>

QSPRpred is open-source software libary for building **Quantitative Structure Property Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug Discovery group. It provides a unified interface for building QSPR models based on different types of descriptors and machine learning algorithms. We developed this package to support our research, recognizing the necessity to reduce repetition in our model building workflow and improve the reproducibility and reusability of our models. In making this package available here, we hope that it may be of use to other researchers as well. QSPRpred is still in active development, and we welcome contributions and feedback from the community.

QSPRpred is designed to be modular and extensible, so that new functionality can be easily added. A command line interface is available for basic use cases to quickly, explore varying scenarios. For more advanced use cases, the Python API offers extra flexibility and control, allowing more complex workflows and additional features.

Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a> and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore, for scikit-learn model saving and loading, QSPRpred uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and interpretable model serialization. QSPRpred is also interoperable with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale curated dataset aimed at bioactivity predictions, for data collection. Models developed with QSPRpred are compatible with the group's *de novo* drug design package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.
QSPRpred is open-source software libary for building **Quantitative Structure Property
Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug
Discovery group. It provides a unified interface for building QSPR models based on
different types of descriptors and machine learning algorithms. We developed this
package to support our research, recognizing the necessity to reduce repetition in our
model building workflow and improve the reproducibility and reusability of our models.
In making this package available here, we hope that it may be of use to other
researchers as well. QSPRpred is still in active development, and we welcome
contributions and feedback from the community.

QSPRpred is designed to be modular and extensible, so that new functionality can be
easily added. A command line interface is available for basic use cases to quickly,
explore varying scenarios. For more advanced use cases, the Python API offers extra
flexibility and control, allowing more complex workflows and additional features.

Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a>
and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore,
for scikit-learn model saving and loading, QSPRpred
uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and
interpretable model serialization. QSPRpred is also interoperable
with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale
curated dataset aimed at bioactivity predictions, for data collection. Models developed
with QSPRpred are compatible with the group's *de novo* drug design
package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.


Quick Start
Expand All @@ -21,45 +40,74 @@ Quick Start
QSPRpred can be installed with pip like so (with python >= 3.10):

```bash
pip install git+https://github.com/CDDLeiden/QSPRPred.git@main
pip install git+https://github.com/CDDLeiden/QSPRpred.git@main
```

Note that this will install the basic dependencies, but not the optional dependencies. If you want to use the optional dependencies, you can install the package with an option:
Note that this will install the basic dependencies, but not the optional dependencies.
If you want to use the optional dependencies, you can install the package with an
option:

```bash
pip install git+https://github.com/CDDLeiden/QSPRPred.git@main#egg=qsprpred[<option>]
pip install git+https://github.com/CDDLeiden/QSPRpred.git@main#egg=qsprpred[<option>]
```

The following options are available:
- extra : include extra dependencies for PCM models and extra descriptor sets from packages other than RDKit

- extra : include extra dependencies for PCM models and extra descriptor sets from
packages other than RDKit
- deep : include deep learning models (torch and chemprop)
- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain cude toolkit from Anaconda as well: `conda install cudatoolkit`)
- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))
- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X
with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain
cude toolkit from Anaconda as well: `conda install cudatoolkit`)
- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`,
replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))

### Note on PCM Modelling

If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS only**):
If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make
sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS
only**):

```bash

conda install -c bioconda clustalo
```

or install MAFFT instead:

```bash
conda install -c biocore mafft
```
This is needed to provide multiple sequence alignments for the PCM descriptors. If Windows is your platform of choice, these tools will need to be installed manually or a custom implementation of the `MSAProvider` class will have to be made.

This is needed to provide multiple sequence alignments for the PCM descriptors. If
Windows is your platform of choice, these tools will need to be installed manually or a
custom implementation of the `MSAProvider` class will have to be made.

## Use
After installation, you will have access to various command line features and you can use the Python API directly (see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which document the use of the Python API to build different types of models. The tutorials as well as the [documentation](https://cddleiden.github.io/QSPRPred/docs/use.html) are still work in progress, and we will be happy for any contributions where it is still lacking.

After installation, you will have access to various command line features and you can
use the Python API directly (
see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you
can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which
document the use of the Python API to build different types of models. The tutorials as
well as the documentation are still work in progress, and we will be happy for any
contributions where it is still lacking.

To use the commandline to train the same QSAR model as in the tutorial use (run from
tutorial folder):

```bash
python -m qsprpred.data_CLI -i ./data/parkinsons_pivot.tsv -o qspr/data -pr GABAAalpha -pr NMDA -r true -sp random -sf 0.15 -fe Morgan
python -m qsprpred.model_CLI -dp ./qspr/data/GABAAalpha_REGRESSION_df.pkl -o ./qspr/models -m PLS -o bayes -nt 5 -me -s
```

Workflow
========
![image](figures/QSPRpred_workflow.png)

Current Development Team
========================

- [H. van den Maagdenberg](https://github.com/HellevdM)
- [M. Sicho](https://github.com/martin-sicho)
- [L. Schoenmaker](https://github.com/LindeSchoenmaker)
Expand Down
106 changes: 50 additions & 56 deletions qsprpred/data/chem/scaffolds.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

import pandas as pd
from rdkit import Chem
from rdkit.Chem import Mol
from rdkit.Chem import Mol, ReplaceSubstructs
from rdkit.Chem.Scaffolds import MurckoScaffold

from qsprpred.data.processing.mol_processor import MolProcessorWithID
Expand Down Expand Up @@ -38,7 +38,8 @@ class Murcko(Scaffold):

def __call__(self, mols, props, *args, **kwargs):
"""
Calculate the Murcko scaffold for a molecule.
Calculate the Murcko scaffold for a molecule as implemented
in RDKit.

Args:
mol: SMILES as `str` or an instance of `Mol`
Expand All @@ -58,49 +59,55 @@ def __str__(self):


class BemisMurcko(Scaffold):
"""
Reimplementation of Bemis-Murcko scaffolds based on a function described in the
discussion here: https://sourceforge.net/p/rdkit/mailman/message/37269507/
"""Extension of rdkit's BM-like scaffold to make it more true to the paper.
In BM's paper, exo bonds on linkers or on rings get cutoff but two
electrons remain.

In the rdkit implementation, both atoms in the exo bond get included.
This means for BM C1CCC1=N and C1CCC1=O are the same, for rdkit they are
different.

When flattening the BM scaffold using MakeScaffoldGeneric() this leads to
distinct scaffolds, as C1CCC1=O is flattened to C1CCC1C and not C1CCC1.

This implementation allows more flexibility in terms of what substructures should be
kept, converted, and how.
In this approach, the two electrons are represented as SMILES "=*". This
is to make sure the automatic oxidation state assignment of sulfur does
not flatten C1CS1(=*)(=*) into C1CS1 when explicit hydrogen count is
provided.

Credit: Francois Berenger
Ref.: Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
Ref.:

Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
Molecular frameworks." Journal of medicinal chemistry, 39(15), 2887-2893.

Related RDKit issue: https://github.com/rdkit/rdkit/discussions/6844

Credit: Original code provided by Wim Dehaen (@dehaenw)
"""

def __init__(
self,
convert_hetero=True,
force_single_bonds=True,
remove_terminal_atoms=True,
id_prop=None,
real_bemismurcko: bool = True,
use_csk: bool = False,
id_prop: bool | None = None,
):
"""
Initialize the scaffold generator.

Args:
convert_hetero (bool):
Convert hetero atoms to carbons.
force_single_bonds (bool):
Convert all scaffold's bonds to single ones.
remove_terminal_atoms (bool):
Remove all terminal atoms, keep only
ring linkers.
id_prop (str):
Name of the property that contains the molecule's unique identifier.
real_bemismurcko (bool): Use guidelines from Bemis murcko paper.
otherwise, use native rdkit implementation.
use_csk (bool): Make scaffold generic (convert all bonds to single
and all atoms to carbon). If real_bemismurcko is on, also
remove all flattened exo bonds.
"""
super().__init__(id_prop=id_prop)
self.convertHetero = convert_hetero
self.forceSingleBonds = force_single_bonds
self.removeTerminalAtoms = remove_terminal_atoms
self.realBemisMurcko = real_bemismurcko
self.useCSK = use_csk

@staticmethod
def findTerminalAtoms(mol):
res = []

for a in mol.GetAtoms():
if len(a.GetBonds()) == 1:
res.append(a)
Expand All @@ -119,36 +126,23 @@ def __call__(self, mols, props, *args, **kwargs):
res = []
for mol in mols:
mol = Chem.MolFromSmiles(mol) if isinstance(mol, str) else mol
only_HA = Chem.rdmolops.RemoveHs(mol)
rw_mol = Chem.RWMol(only_HA)

# switch all HA to Carbon
if self.convertHetero:
for i in range(rw_mol.GetNumAtoms()):
rw_mol.ReplaceAtom(i, Chem.Atom(6))

# switch all non single bonds to single
if self.forceSingleBonds:
non_single_bonds = []
for b in rw_mol.GetBonds():
if b.GetBondType() != Chem.BondType.SINGLE:
non_single_bonds.append(b)
for b in non_single_bonds:
j = b.GetBeginAtomIdx()
k = b.GetEndAtomIdx()
rw_mol.RemoveBond(j, k)
rw_mol.AddBond(j, k, Chem.BondType.SINGLE)

# as long as there are terminal atoms, remove them
if self.removeTerminalAtoms:
terminal_atoms = self.findTerminalAtoms(rw_mol)
while terminal_atoms:
for a in terminal_atoms:
for b in a.GetBonds():
rw_mol.RemoveBond(b.GetBeginAtomIdx(), b.GetEndAtomIdx())
rw_mol.RemoveAtom(a.GetIdx())
terminal_atoms = self.findTerminalAtoms(rw_mol)
res.append(Chem.MolToSmiles(rw_mol.GetMol()))
Chem.RemoveStereochemistry(mol) # important for canonization !
scaff = MurckoScaffold.GetScaffoldForMol(mol)

if self.realBemisMurcko:
scaff = ReplaceSubstructs(
scaff,
Chem.MolFromSmarts("[$([D1]=[*])]"),
Chem.MolFromSmarts("[*]"),
replaceAll=True,
)[0]

if self.useCSK:
scaff = MurckoScaffold.MakeScaffoldGeneric(scaff)
if self.realBemisMurcko:
scaff = MurckoScaffold.GetScaffoldForMol(scaff)
Chem.SanitizeMol(scaff)
res.append(Chem.MolToSmiles(scaff))
return pd.Series(res, index=props[self.idProp])

def __str__(self):
Expand Down
3 changes: 3 additions & 0 deletions qsprpred/data/chem/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,9 @@ def setUp(self):
[
("Murcko", Murcko()),
("BemisMurcko", BemisMurcko()),
("BemisMurckoCSK", BemisMurcko(True, True)),
("BemisMurckoJustCSK", BemisMurcko(False, True)),
("BemisMurckoOff", BemisMurcko(False, False)),
]
)
def testScaffoldAdd(self, _, scaffold):
Expand Down
6 changes: 5 additions & 1 deletion qsprpred/data/sampling/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,11 @@ def testTemporalSplit(self, multitask):
@parameterized.expand(
[
(False, Murcko(), None),
(False, BemisMurcko(), ["ScaffoldSplit_000", "ScaffoldSplit_001"]),
(
False,
BemisMurcko(use_csk=True),
["ScaffoldSplit_000", "ScaffoldSplit_001"],
),
(True, Murcko(), None),
]
)
Expand Down
14 changes: 12 additions & 2 deletions qsprpred/extra/data/utils/testing/path_mixins.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
import os
import platform
import tempfile
from typing import Callable

Expand Down Expand Up @@ -27,6 +28,7 @@
)
from qsprpred.extra.data.tables.pcm import PCMDataSet
from qsprpred.extra.data.utils.msa_calculator import ClustalMSA
from qsprpred.logs import logger
from qsprpred.utils.testing.path_mixins import DataSetsPathMixIn


Expand All @@ -44,9 +46,8 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
Returns:
list: list of `MoleculeDescriptorSet` objects
"""
return [
ret = [
Mordred(),
Mold2(),
CDKFP(size=2048, search_depth=7),
CDKExtendedFP(),
CDKEStateFP(),
Expand All @@ -62,6 +63,15 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
PaDEL(),
ExtendedValenceSignature(1),
]
if platform.system() != "Darwin":
ret.append(Mold2())
else:
# not supported on macOS
logger.warning(
"Mold2 is not supported on macOS. "
"Skipping Mold2 descriptor set in tests."
)
return ret

@classmethod
def getAllProteinDescriptors(cls) -> list[ProteinDescriptorSet]:
Expand Down
Loading