There are two utilities that deal with duplicate molecules, common_names
and unique_molecules
.
If all you care about is eliminating duplicates as quickly and efficiently
as possible, use unique_molecules
.
The purpose of common_names
is to allow identification of which
molecules are duplicates.
The help message for common_names
is
Gather all the names duplicate structures together
-a compare graph forms - add 2nd -a option to include H count
-c exclude chirality information
-x exclude directional bonds
-l strip to largest fragment
-I remove isotopes before storing
-T ... standard element transformation options, enter '-T help'
-D <separator> separator for when storing duplicate entries
-f single pass operation, smiles output only
-y write first name and count of smiles only
-s <size> maximum number of molecules to process
-r <number> report progress every <number> molecules processed
-X ... miscellaneous and obscure options
-i <type> specify input file type
-S <name> specify name for output
-o <type> specify output file type
-E ... standard element options
-A <qualifier> Aromaticity, enter "-A help" for options
-g <qualifier> chemical standardisations, enter "-g help" for usage
-v verbose output
Molecules can be compared for equality in different ways. Several command line options govern what components of the input molecules go into the comparison.
Convert molecules to graph form, remove chirality, isotopes and
charges, all bonds become single bonds, and compare the unique
smiles of the resulting molecules. In this way, benzene and cyclohexane
are identical. Add a second -a
option and the hydrogen count of
the starting molecule is included with the unique smiles, so
benzene and cyclohexane would no longer be the same.
The molecular graph is useful for doing tautomer comparisons, but may in some cases be too loose.
Remove chirality before comparing. Be cognisant of the accuracy of chirality information you may have.
Remove cis-trans bonds before comparing. This is highly recommended since this information is frequently not accurate.
Strip counterions before comparing.
Remove any isotopic information before comparing.
common_names
generates an output file that has molecules found to
be duplicates grouped together. So if the input contained
C methane
C CH4
the output file would contain
C methane:CH4
where the :
character is set by the -D option.
If you are just writing smiles, this can be a faster, but memory requirements may be larger. Definitely worth starting here, and if it works, great.
Sets an initial size for certain internal data structures. It is not
used when using the -f
option, but when processing multiple input
files, it is required. Set to the number of molecules in your
input file(s).
Report progress every <number> molecules processed. This can be helpful when processing large datasets.
Only works in the absence of the -f
option. Rather than concatenate
all names found, just write the first one and a count. So, our
methane example would be written as
C methane 2
As duplicate structures are identified, write them to <fname>.
Apply one or more reactions to molecules before doing the comparison. Not sure if this is a good idea or not, but today only old style reaction files are supported.
Frequently it is desirable to consider the heavy halogens to be equivalent. Use the -T option to enable this.
-T 'I=Cl' -T 'Br=Cl'
enables such a transformation.
Specify a file name stem for the output.
Several options are available via the -X option.
-X num
In the -S
output file, include a count of the number of instances of each
structure.
-X proto=<fname> Write the results as textproto form to <fname>. This might look like
smiles: "[N+]1(=CN(C)C=C1)CCCC" key: "CCCC[n+]1c[n](C)cc1" id: "CHEMBL3182180" id: "CHEMBL3184676" id: "CHEMBL3559949"
Where id
is a repeated field, holding the id's of all the molecules matching
the key.
The other options are common across different tools.