CDDLeiden · martin-sicho · Jan 19, 2024 · Oct 19, 2023 · Oct 27, 2023 · Nov 22, 2023
diff --git a/README.md b/README.md
@@ -6,11 +6,30 @@ QSPRpred
 <img src='figures/QSPRpred_logo.jpg' width=10% align=right>
 <p align=left width=70%>
 
-QSPRpred is open-source software libary for building **Quantitative Structure Property Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug Discovery group. It provides a unified interface for building QSPR models based on different types of descriptors and machine learning algorithms. We developed this package to support our research, recognizing the necessity to reduce repetition in our model building workflow and improve the reproducibility and reusability of our models. In making this package available here, we hope that it may be of use to other researchers as well. QSPRpred is still in active development, and we welcome contributions and feedback from the community.
-
-QSPRpred is designed to be modular and extensible, so that new functionality can be easily added. A command line interface is available for basic use cases to quickly, explore varying scenarios. For more advanced use cases, the Python API offers extra flexibility and control, allowing more complex workflows and additional features. 
-
-Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a> and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore, for scikit-learn model saving and loading, QSPRpred uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and interpretable model serialization. QSPRpred is also interoperable with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale curated dataset aimed at bioactivity predictions, for data collection. Models developed with QSPRpred are compatible with the group's *de novo* drug design package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.
+QSPRpred is open-source software libary for building **Quantitative Structure Property
+Relationship (QSPR)** model developed by Gerard van Westen's Computational Drug
+Discovery group. It provides a unified interface for building QSPR models based on
+different types of descriptors and machine learning algorithms. We developed this
+package to support our research, recognizing the necessity to reduce repetition in our
+model building workflow and improve the reproducibility and reusability of our models.
+In making this package available here, we hope that it may be of use to other
+researchers as well. QSPRpred is still in active development, and we welcome
+contributions and feedback from the community.
+
+QSPRpred is designed to be modular and extensible, so that new functionality can be
+easily added. A command line interface is available for basic use cases to quickly,
+explore varying scenarios. For more advanced use cases, the Python API offers extra
+flexibility and control, allowing more complex workflows and additional features.
+
+Internally, QSPRpred relies heavily on the <a href="https://www.rdkit.org">RDKit</a>
+and <a href="https://scikit-learn.org/stable/">scikit-learn</a> libraries. Furthermore,
+for scikit-learn model saving and loading, QSPRpred
+uses <a href="https://github.com/OlivierBeq/ml2json">ml2json</a> for safer and
+interpretable model serialization. QSPRpred is also interoperable
+with <a href="https://github.com/OlivierBeq/Papyrus-scripts">Papyrus</a>, a large scale
+curated dataset aimed at bioactivity predictions, for data collection. Models developed
+with QSPRpred are compatible with the group's *de novo* drug design
+package <a href="https://github.com/CDDLeiden/DrugEx/">DrugEx</a>.
 
 
 Quick Start
@@ -21,45 +40,74 @@ Quick Start
 QSPRpred can be installed with pip like so (with python >= 3.10):
 
 ```bash
-pip install git+https://github.com/CDDLeiden/QSPRPred.git@main
+pip install git+https://github.com/CDDLeiden/QSPRpred.git@main
 ```
 
-Note that this will install the basic dependencies, but not the optional dependencies. If you want to use the optional dependencies, you can install the package with an option:
+Note that this will install the basic dependencies, but not the optional dependencies.
+If you want to use the optional dependencies, you can install the package with an
+option:
 
 ```bash
-pip install git+https://github.com/CDDLeiden/QSPRPred.git@main#egg=qsprpred[<option>]
+pip install git+https://github.com/CDDLeiden/QSPRpred.git@main#egg=qsprpred[<option>]
 ```
 
 The following options are available:
-- extra : include extra dependencies for PCM models and extra descriptor sets from packages other than RDKit
+
+- extra : include extra dependencies for PCM models and extra descriptor sets from
+  packages other than RDKit
 - deep : include deep learning models (torch and chemprop)
-- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain cude toolkit from Anaconda as well: `conda install cudatoolkit`)
-- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`, replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))
+- pyboost : include pyboost model (requires cupy, `pip install cupy-cudaX`, replace X
+  with your [cuda version](https://docs.cupy.dev/en/stable/install.html), you can obtain
+  cude toolkit from Anaconda as well: `conda install cudatoolkit`)
+- full : include all optional dependecies (requires cupy, `pip install cupy-cudaX`,
+  replace X with your [cuda version](https://docs.cupy.dev/en/stable/install.html))
 
 ### Note on PCM Modelling
 
-If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS only**):
+If you plan to optionally use QSPRPred to calculate protein descriptors for PCM, make
+sure to also install Clustal Omega. You can get it via `conda` (**for Linux and MacOS
+only**):
 
 ```bash
 
 conda install -c bioconda clustalo
 ```
+
 or install MAFFT instead:
 
 ```bash
 conda install -c biocore mafft
 ```
-This is needed to provide multiple sequence alignments for the PCM descriptors. If Windows is your platform of choice, these tools will need to be installed manually or a custom implementation of the `MSAProvider` class will have to be made.
+
+This is needed to provide multiple sequence alignments for the PCM descriptors. If
+Windows is your platform of choice, these tools will need to be installed manually or a
+custom implementation of the `MSAProvider` class will have to be made.
 
 ## Use
-After installation, you will have access to various command line features and you can use the Python API directly (see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which document the use of the Python API to build different types of models. The tutorials as well as the [documentation](https://cddleiden.github.io/QSPRPred/docs/use.html) are still work in progress, and we will be happy for any contributions where it is still lacking.
+
+After installation, you will have access to various command line features and you can
+use the Python API directly (
+see [Documentation](https://cddleiden.github.io/QSPRPred/docs/)). For a quick start, you
+can also check out the [Jupyter notebook tutorials](./tutorials/README.md), which
+document the use of the Python API to build different types of models. The tutorials as
+well as the documentation are still work in progress, and we will be happy for any
+contributions where it is still lacking.
+
+To use the commandline to train the same QSAR model as in the tutorial use (run from
+tutorial folder):
+
+```bash
+python -m qsprpred.data_CLI -i ./data/parkinsons_pivot.tsv -o qspr/data -pr GABAAalpha -pr NMDA -r true -sp random -sf 0.15 -fe Morgan
+python -m qsprpred.model_CLI -dp ./qspr/data/GABAAalpha_REGRESSION_df.pkl -o ./qspr/models -m PLS -o bayes -nt 5 -me -s
+```
 
 Workflow
 ========
 ![image](figures/QSPRpred_workflow.png)
 
 Current Development Team
 ========================
+
 - [H. van den Maagdenberg](https://github.com/HellevdM)
 - [M. Sicho](https://github.com/martin-sicho)
 - [L. Schoenmaker](https://github.com/LindeSchoenmaker)

diff --git a/qsprpred/data/chem/scaffolds.py b/qsprpred/data/chem/scaffolds.py
@@ -2,7 +2,7 @@
 
 import pandas as pd
 from rdkit import Chem
-from rdkit.Chem import Mol
+from rdkit.Chem import Mol, ReplaceSubstructs
 from rdkit.Chem.Scaffolds import MurckoScaffold
 
 from qsprpred.data.processing.mol_processor import MolProcessorWithID
@@ -38,7 +38,8 @@ class Murcko(Scaffold):
 
     def __call__(self, mols, props, *args, **kwargs):
         """
-        Calculate the Murcko scaffold for a molecule.
+        Calculate the Murcko scaffold for a molecule as implemented
+        in RDKit.
 
         Args:
             mol: SMILES as `str` or an instance of `Mol`
@@ -58,49 +59,55 @@ def __str__(self):
 
 
 class BemisMurcko(Scaffold):
-    """
-    Reimplementation of Bemis-Murcko scaffolds based on a function described in the
-    discussion here: https://sourceforge.net/p/rdkit/mailman/message/37269507/
+    """Extension of rdkit's BM-like scaffold to make it more true to the paper.
+    In BM's paper, exo bonds on linkers or on rings get cutoff but two
+    electrons remain.
+
+    In the rdkit implementation, both atoms in the exo bond get included.
+    This means for BM C1CCC1=N and C1CCC1=O are the same, for rdkit they are
+    different.
+
+    When flattening the BM scaffold using MakeScaffoldGeneric() this leads to
+    distinct scaffolds, as C1CCC1=O is flattened to C1CCC1C and not C1CCC1.
 
-    This implementation allows more flexibility in terms of what substructures should be
-    kept, converted, and how.
+    In this approach, the two electrons are represented as SMILES "=*". This
+    is to make sure the automatic oxidation state assignment of sulfur does
+    not flatten C1CS1(=*)(=*) into C1CS1 when explicit hydrogen count is
+    provided.
 
-    Credit: Francois Berenger
-    Ref.: Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
+    Ref.:
+
+    Bemis, G. W., & Murcko, M. A. (1996). "The properties of known drugs. 1.
     Molecular frameworks." Journal of medicinal chemistry, 39(15), 2887-2893.
 
+    Related RDKit issue: https://github.com/rdkit/rdkit/discussions/6844
+
+    Credit: Original code provided by Wim Dehaen (@dehaenw)
     """
 
     def __init__(
         self,
-        convert_hetero=True,
-        force_single_bonds=True,
-        remove_terminal_atoms=True,
-        id_prop=None,
+        real_bemismurcko: bool = True,
+        use_csk: bool = False,
+        id_prop: bool | None = None,
     ):
         """
         Initialize the scaffold generator.
 
         Args:
-            convert_hetero (bool):
-                Convert hetero atoms to carbons.
-            force_single_bonds (bool):
-                Convert all scaffold's bonds to single ones.
-            remove_terminal_atoms (bool):
-                Remove all terminal atoms, keep only
-                ring linkers.
-            id_prop (str):
-                Name of the property that contains the molecule's unique identifier.
+            real_bemismurcko (bool): Use guidelines from Bemis murcko paper.
+                otherwise, use native rdkit implementation.
+            use_csk (bool): Make scaffold generic (convert all bonds to single
+                and all atoms to carbon). If real_bemismurcko is on, also
+                remove all flattened exo bonds.
         """
         super().__init__(id_prop=id_prop)
-        self.convertHetero = convert_hetero
-        self.forceSingleBonds = force_single_bonds
-        self.removeTerminalAtoms = remove_terminal_atoms
+        self.realBemisMurcko = real_bemismurcko
+        self.useCSK = use_csk
 
     @staticmethod
     def findTerminalAtoms(mol):
         res = []
-
         for a in mol.GetAtoms():
             if len(a.GetBonds()) == 1:
                 res.append(a)
@@ -119,36 +126,23 @@ def __call__(self, mols, props, *args, **kwargs):
         res = []
         for mol in mols:
             mol = Chem.MolFromSmiles(mol) if isinstance(mol, str) else mol
-            only_HA = Chem.rdmolops.RemoveHs(mol)
-            rw_mol = Chem.RWMol(only_HA)
-
-            # switch all HA to Carbon
-            if self.convertHetero:
-                for i in range(rw_mol.GetNumAtoms()):
-                    rw_mol.ReplaceAtom(i, Chem.Atom(6))
-
-            # switch all non single bonds to single
-            if self.forceSingleBonds:
-                non_single_bonds = []
-                for b in rw_mol.GetBonds():
-                    if b.GetBondType() != Chem.BondType.SINGLE:
-                        non_single_bonds.append(b)
-                for b in non_single_bonds:
-                    j = b.GetBeginAtomIdx()
-                    k = b.GetEndAtomIdx()
-                    rw_mol.RemoveBond(j, k)
-                    rw_mol.AddBond(j, k, Chem.BondType.SINGLE)
-
-            # as long as there are terminal atoms, remove them
-            if self.removeTerminalAtoms:
-                terminal_atoms = self.findTerminalAtoms(rw_mol)
-                while terminal_atoms:
-                    for a in terminal_atoms:
-                        for b in a.GetBonds():
-                            rw_mol.RemoveBond(b.GetBeginAtomIdx(), b.GetEndAtomIdx())
-                        rw_mol.RemoveAtom(a.GetIdx())
-                    terminal_atoms = self.findTerminalAtoms(rw_mol)
-                res.append(Chem.MolToSmiles(rw_mol.GetMol()))
+            Chem.RemoveStereochemistry(mol)  # important for canonization !
+            scaff = MurckoScaffold.GetScaffoldForMol(mol)
+
+            if self.realBemisMurcko:
+                scaff = ReplaceSubstructs(
+                    scaff,
+                    Chem.MolFromSmarts("[$([D1]=[*])]"),
+                    Chem.MolFromSmarts("[*]"),
+                    replaceAll=True,
+                )[0]
+
+            if self.useCSK:
+                scaff = MurckoScaffold.MakeScaffoldGeneric(scaff)
+                if self.realBemisMurcko:
+                    scaff = MurckoScaffold.GetScaffoldForMol(scaff)
+            Chem.SanitizeMol(scaff)
+            res.append(Chem.MolToSmiles(scaff))
         return pd.Series(res, index=props[self.idProp])
 
     def __str__(self):

diff --git a/qsprpred/data/chem/tests.py b/qsprpred/data/chem/tests.py
@@ -21,6 +21,9 @@ def setUp(self):
         [
             ("Murcko", Murcko()),
             ("BemisMurcko", BemisMurcko()),
+            ("BemisMurckoCSK", BemisMurcko(True, True)),
+            ("BemisMurckoJustCSK", BemisMurcko(False, True)),
+            ("BemisMurckoOff", BemisMurcko(False, False)),
         ]
     )
     def testScaffoldAdd(self, _, scaffold):

diff --git a/qsprpred/data/sampling/tests.py b/qsprpred/data/sampling/tests.py
@@ -110,7 +110,11 @@ def testTemporalSplit(self, multitask):
     @parameterized.expand(
         [
             (False, Murcko(), None),
-            (False, BemisMurcko(), ["ScaffoldSplit_000", "ScaffoldSplit_001"]),
+            (
+                False,
+                BemisMurcko(use_csk=True),
+                ["ScaffoldSplit_000", "ScaffoldSplit_001"],
+            ),
             (True, Murcko(), None),
         ]
     )

diff --git a/qsprpred/extra/data/utils/testing/path_mixins.py b/qsprpred/extra/data/utils/testing/path_mixins.py
@@ -1,4 +1,5 @@
 import os
+import platform
 import tempfile
 from typing import Callable
 
@@ -27,6 +28,7 @@
 )
 from qsprpred.extra.data.tables.pcm import PCMDataSet
 from qsprpred.extra.data.utils.msa_calculator import ClustalMSA
+from qsprpred.logs import logger
 from qsprpred.utils.testing.path_mixins import DataSetsPathMixIn
 
 
@@ -44,9 +46,8 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
         Returns:
             list: list of `MoleculeDescriptorSet` objects
         """
-        return [
+        ret = [
             Mordred(),
-            Mold2(),
             CDKFP(size=2048, search_depth=7),
             CDKExtendedFP(),
             CDKEStateFP(),
@@ -62,6 +63,15 @@ def getAllDescriptors(cls) -> list[DescriptorSet]:
             PaDEL(),
             ExtendedValenceSignature(1),
         ]
+        if platform.system() != "Darwin":
+            ret.append(Mold2())
+        else:
+            # not supported on macOS
+            logger.warning(
+                "Mold2 is not supported on macOS. "
+                "Skipping Mold2 descriptor set in tests."
+            )
+        return ret
 
     @classmethod
     def getAllProteinDescriptors(cls) -> list[ProteinDescriptorSet]: