Skip to content

This repository contains KNIME workflows used to standardise chemical structures

License

Notifications You must be signed in to change notification settings

PharminfoVienna/Chemical-Structure-Standardisation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This Repository contains a structure standardisation workflow which was used in several papers of the Pharmacoinformatics research group


The folders contain workflows with different versions of the standardiser. The publication should always indicate which version was used.

The folder Python script contains the Python script which is used in the Python node within the workflow. It can be used separately as well. In the future we hope to be able to provide a command line tool to standardise structures.

Support:


Jennifer Hemmerich, jennifer.hemmerich[at]univie.ac.at

Important Note:


The workflow requires you to have an anaconda environment with python 3.6 or above and RDKit (2019 and above) installed.

Further instructions how to set this up can be found here: https://docs.knime.com/latest/python_installation_guide/index.html

The workflow can also be retrieved from the KNIME Hub at https://kni.me/w/auOFJsQKZXJmSc_9, however please use the support email or github for any issues related to the workflow

Dependencies


Python >= 3.6.X

RDKit >= 2019.X.X

Usage:


  1. Download the workflow and load it into your KNIME installation.
  2. Double click on the Input Selection Metanode/Component and choose your sdf or csv file. Run the node.
  3. Double click on the Molecule Format component and choose the appropriate Moelcule Format (currently Smiles and SDF are supported)
  4. Double click on the Standardiser component. Choose the appropriate Settings, informations on the options can be found in the help menu. --> The standardiser creates multiple columns to inform you about the standardisation process and possible problems with molecules. Please inspect them carefully to ensure that no problems occurred.
  5. Configure the sdf writer to get a dataset with the standardised molecules. You can also use the output directly for your own workflows.
  6. We recommend checking for duplicates by merging on the InChIKeys which are generated by the workflow

For any issues or bugs please contact support or open an issue.

Standardisation Protocol:


In summary, the general procedure for standardising a molecule (with the documentation for the appropriate module linked) is:

First a Molecule is checked for fragments, then for each fragment the following steps are run:

Break bonds to Group I or II metals
Apply standardization rules
Neutralize charges by adding/removing protons

Depending on the marked options the following Actions are carried out:

Keep all Molecules? --> all Molecules (Standardised and non-standardised) are kept. If more than one Molecule is present in a column it is split to separate rows (These Molecules can be identified in Molecule_index column)

Remove Molecules with inoroganic Atoms (Organic: H, C, N, O, F, P, S, Cl, Br, I) --> If nonarganic Atoms eg B, Sn,...) are found the Molecule is sent to the second output

Remove Mixtures --> Mixtures are sent to the second output, if they should not be removed, they are split into separate rows and a flag is added in the Mixture column

Additionally, within the workflow the stereochemistry can be standardised or removed.

Background


Structural representation is not unified between representations and tasks. Eg the same functional group can be representated in many different ways:

But not only this, a simple molecule such as ethanol can have various correct representations as SMILES (OCC, C(O)C, CCO). Although Canonical SMILES exist, it is not standardised between different tools, as all are using their own algorithm to canonicalize the SMILES string. For a computer program these structures look different, although we know they are not. This makes duplicate removal very tedious for big datasets. Hence we would aim at always choosing the same representation, so we can automatically detect duplicate structures in large datasets. Further a Molecule could contain Fragments, Mixtures, Salts or Solvents, which would create Artifacts during Screening or Descriptor generation. Eg for Carbachol the Molecule contains a Chloride, or Sunitinib DMSO contains DMSO:

    C[N+](C)(C)CCOC(=O)N.[Cl-]                             CCN(CC)CCNC(=O)C1=C(NC(=C1C)C=C2C3=C(C=CC(=C3)F)NC2=O)C.CS(=O)C

If we would keep this, we could accidentally only dock or predict the Chloride, missing out on the Carbachol which fits into our pocket, or we calculate the descriptors for the Chloride instead of the Carcachol we would want to use. For Chloride our Programs would probably warn us as it is not an organic molecules, but for Sunitinib we have a higher Chance to use DMSO as it is an organic compound.

For our standardisation we chose to use a standardisation protocol as proposed by Francis Atkinson from the EMBL-EBI. The tool was developed for the e-Tox Project (https://wwwdev.ebi.ac.uk/chembl/extra/francis/standardiser/). It is available trough the Python Package Index as a Python Package. However since Summer 2019 RDKit (https://www.rdkit.org/ and https://www.rdkit.org/docs/source/rdkit.Chem.MolStandardize.rdMolStandardize.html) contains the functionality of another python library for standardisation molVS (https://github.com/mcs07/MolVS/blob/master/docs/source/index.rst). Thus, we upgraded the node to only rely on RDKit, therefore the standardiser library is not needed anymore.

Citation


This code is © J. Hemmerich, 2020, and it is made available under the GPL license enclosed with the software.

If you use this software for an academic publication then you are obliged to provide proper attribution. Please use the following citation:

@software{Hemmerich2020,
	title = {{KNIME} Structure Standardisation Workflow},
	url = {https://github.com/PharminfoVienna/Chemical-Structure-Standardisation},
	version = {0.2.0},
	publisher = {Department of Pharmaceutical Chemistry, University of Vienna},
	author = {Hemmerich, Jennifer},
	date = {2020-01-31},
	note = {https://kni.me/w/{auOFJsQKZXJmSc}\_9}
}

About

This repository contains KNIME workflows used to standardise chemical structures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published