Skip to content

Work done by each step

Greg Landrum edited this page Feb 7, 2019 · 14 revisions

StandardiseMolecule()

Current

  1. Clears S Group data from the mol file
  2. Kekulise structures
  3. Identifies and fixes bad valence (where possible)
  4. Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
  5. Standardise NO2 groups to N+[O-]
  6. Change NH+ Cl- to N and HCl
  7. Remove stereo from tartrate to simplify salt matching
  8. Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
  9. Remove explicit hydrogens from molecules (excepting certain atom types)
  10. Normalise (straighten) triple bonds and allenes
  11. Standardise sulphoxides to charge separated form (need to take care this is just sulphoxides and not sulphonamides or sulphones)
  12. “Fix wiggly bonds” on double bonds – set to crossed bond
  13. If formal charge on molecule is not zero then protonate and/or deprotonate to neutralise molecule (accepting that for some examples of compounds with quaternary nitrogens and multiple carboxyl, sulphates, phosphates this might not be possible).
  14. Correct amides with N=COH
  15. Correct bond angles where possible

New (proposed)

  1. Clears S Group data from the mol file
  2. RDKit Standardization:
    1. Identifies and fixes bad valence (where possible)
    2. Kekulise structures
    3. Standardise NO2 groups to N+[O-]
    4. Fix wiggly bonds on sp3 carbons - sets atoms and bonds marked as unknown stereo to no stereo
    5. “Fix wiggly bonds” on double bonds – set to crossed bond
    6. Remove explicit hydrogens from molecules (excepting certain atom types)
  3. Fix KO to K+ O- and NaO to Na+ O- (Also add Li+ to this)
  4. Change NH+ Cl- to N and HCl
  5. Remove stereo from tartrate to simplify salt matching
  6. Normalise (straighten) triple bonds and allenes
  7. Standardise sulphoxides to charge separated form (need to take care this is just sulphoxides and not sulphonamides or sulphones)
  8. If formal charge on molecule is not zero then protonate and/or deprotonate to neutralise molecule (accepting that for some examples of compounds with quaternary nitrogens and multiple carboxyl, sulphates, phosphates this might not be possible).
  9. Correct amides with N=COH
  10. Correct bond angles where possible
  11. Add chiral Hs

GetParentMolecule()

Current

  1. Identify salts or isotopes for processing (currently this is done from the inchi e.g. inchi like '%.%' or inchi like '%/i%’ but can be achieved in other ways).
  2. Identify metal containing compounds for which structures are to be excluded from ChEMBL (using a metal list and a set of rules – details to be supplied). For these molecules they do not need to be salt stripped. However, from these compounds water molecules and isotopes are removed to make a parent so they can be grouped as parents and salts
  3. For all other compounds salt stripping is performed to form a parent molecule
  4. Salts removed according to the CHEMBL salt list
  5. Solvents are then removed according to the solvents list
  6. Isotopes are removed
  7. If no salts are removed the molecule is defined as a mixture but isotopes are then removed
  8. Check for “empty" salts i.e. both components are salts. If this is true the salt is recreated but any isotopes or solvent are removed
  9. Formal charges before and after salt stripping are checked and parent neutralised if necessary. This will deal with carboxylic acid salts and HCl salts where the molecules are drawn as XNH+ and Cl- rather than XN and HCl
  10. Allows the option to add specific compounds that are flagged to be salt stripped even though they fit the criteria for being excluded from the process. Currently one specific example is ranitidine bismuth citrate (CHEMBL2111286).

New (proposed)

Clone this wiki locally