Skip to content

Work done by each step

Greg Landrum edited this page Dec 4, 2019 · 14 revisions

standardize_molblock()

If the molecule causes the exclusion flag to be set, this function does nothing. See the page on the exclusion flag for more details.

  1. Standardize unknown stereochemistry (Handled by the RDKit Mol file parser)
    1. Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
    2. Fix wiggly bonds on double bonds – set double bond to crossed bond
  2. Clears S Group data from the mol file
  3. Kekulize the structure
  4. Remove H atoms (See the page on explicit Hs for more details)
  5. Normalization:
    1. Fix hypervalent nitro groups
    2. Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
    3. Correct amides with N=COH
    4. Standardise sulphoxides to charge separated form
    5. Standardize diazonium N (atom :2 here: [*:1]-[N;X2:2]#[N;X1:3]>>[*:1]) to N+
    6. Ensure quaternary N is charged
    7. Ensure trivalent O ([*:1]=[O;X2;v3;+0:2]-[#6:3]) is charged
    8. Ensure trivalent S ([O:1]=[S;D2;+0:2]-[#6:3]) is charged
    9. Ensure halogen with no neighbors ([F,Cl,Br,I;X0;+0:1]) is charged
  6. The molecule is neutralized, if possible. See the page on neutralization rules for more details.
  7. Remove stereo from tartrate to simplify salt matching
  8. Normalise (straighten) triple bonds and allenes

get_parent_molblock()

  1. Sets all isotopes to 0 and removes Hs. This extra H removal step is necessary because previous H removal will have skipped D or T atoms (see the page on explicit Hs for more details)
  2. Solvents (defined in a list) are removed unless this step removes all fragments, in which case the molecule is not modified.
  3. Salts (defined in a list) are removed unless this step removes all fragments, in which the molecule is not modified.
  4. Duplicate fragments (duplication detected after neutralizing and removing Hs from fragments) are removed. Duplicates are detected using canonical SMILES, so fragments that are tautomers of each other will not be removed in this step.
  5. The remaining molecule is neutralized, if possible. See the page on neutralization rules for more details.
  6. The exclusion flag is checked (see the page on the exclusion flag for more details). If the flag is set, the molecule remaining after solvent stripping is returned. Two specific examples illustrating where this makes a difference are ranitidine bismuth citrate (CHEMBL2111286) - the Bi ion is stripped as a salt and ranitidine is the parent - and CuCl2.2H2O - the waters (solvent) and chloride ions (salts) are removed, but the exclusion flag is set by the Cu+2 that remains so the parent is CuCl2.

CheckMolecule()

Implemented checks

  1. Number of atoms <1 i.e. empty CTAB
  2. Polymer
  3. V3000 mol file
  4. 3D coordinates in mol file
  5. 3D flag set on 2D molecule
  6. Illegal bond type
  7. Illegal bond stereo
  8. Multiple stereobonds on stereoatom
  9. Overlapping atoms (atoms with identical coordinates)
  10. Zero coordinates (all atoms have zero coordinates) - can happen when mol file created from smiles
  11. Stereobond in ring
  12. Stereobond between stereo centres
  13. Crossed bonds in ring
  14. Radicals that don’t fit known stable radical patterns (allowed are 'Nitric Oxide, Aminoxyl’)
  15. StereoCenters MOL/InChI/RDKit mismatch
  16. StereoCenters MOL_RDKit/InChI mismatch
  17. StereoCenters MOL_InChI/RDKit mismatch
  18. StereoCenters InChI_RDKit/MOL mismatch
  19. InChI warning:Accepted unusual valence(s)
  20. InChI warning:Empty structure
  21. InChI ambiguous stereo
  22. Any other InChI error/warning
  23. Illegal input (mol block could not be parsed)

A note on the StereoCenters counts:

  • Mol: number of atoms where a wedged bond starts
  • InChI: number of tetrahedral stereocenters
  • RDKit: number of atomic stereocenters remaining after calling Chem.AssignStereochemistry()