@@ -7,11 +7,14 @@ smilite is a Python module to download and analyze SMILE strings (Simplified Mol
7
7
Now supports both Python 3.x and Python 2.x.
8
8
9
9
####Sections
10
- <p ><a href =" #installation " >Installation</a ><br >
11
- <p ><a href =" #documentation " >Documentation</a ><br >
12
- <p ><a href =" #examples " >Command Line Scripts Examples</a ><br >
13
- <p ><a href =" #contact " >Contact</a ><br >
14
- <p ><a href =" #changelog " >Changelog</a ><br >
10
+ • ; <a href =" #installation " >Installation</a ><br >
11
+ • ; <a href =" #documentation " >Documentation</a ><br >
12
+ • ; <a href =" #examples " >Command Line Scripts Examples</a ><br >
13
+   ;  ;  ;  ;  ;  ; - ; <a href =" #gen_zincid " >gen_zincid_smile_csv.py (downloading SMILES)</a ><br >
14
+   ;  ;  ;  ;  ;  ; - ; <a href =" #comp_smile " >comp_smile_strings.py (checking for duplicates within 1 file)</a ><br >
15
+   ;  ;  ;  ;  ;  ; - ; <a href =" #comp_2_smile " >comp_2_smile_files.py (checking for duplicates across 2 files)</a ><br >
16
+ • ; <a href =" #contact " >Contact</a ><br >
17
+ • ; <a href =" #changelog " >Changelog</a ><br >
15
18
16
19
17
20
<br >
@@ -42,47 +45,92 @@ Documentation
42
45
After you installed the smilite module, you can import it in Python via ` import smilite ` .
43
46
The current functions include:
44
47
45
- <pre >def get_zinc_smile(zinc_id):
46
- Gets the corresponding SMILE string for a ZINC ID query from
47
- the ZINC online database. Requires an internet connection.
48
- Keyword arguments:
49
- zinc_id (str): A valid ZINC ID, e.g. 'ZINC00029323'
50
- Returns the SMILE string for the corresponding ZINC ID.
51
- E.g., 'COc1cccc(c1)NC(=O)c2cccnc2'</pre >
48
+ <div style =" background : #ffffff ; overflow :auto ;width :auto ;border :solid gray ;border-width :.1em .1em .1em .8em ;padding :.2em .6em ;" ><pre style =" margin : 0 ; line-height : 125% " ><span style =" color : #008800 ; font-weight : bold " >def</span > <span style =" color : #0066BB ; font-weight : bold " >get_zinc_smile</span >(zinc_id):
49
+ <span style="color: #DD4422">"""</span>
50
+ <span style =" color : #DD4422 " > Gets the corresponding SMILE string for a ZINC ID query from</span >
51
+ <span style =" color : #DD4422 " > the ZINC online database. Requires an internet connection.</span >
52
+
53
+ <span style =" color : #DD4422 " > Keyword arguments:</span >
54
+ <span style =" color : #DD4422 " > zinc_id (str): A valid ZINC ID, e.g. ' ; ZINC00029323' ; </span >
55
+
56
+ <span style =" color : #DD4422 " > Returns the SMILE string for the corresponding ZINC ID.</span >
57
+ <span style =" color : #DD4422 " > E.g., ' ; COc1cccc(c1)NC(=O)c2cccnc2' ; </span >
58
+
59
+ <span style =" color : #DD4422 " > " ;" ;" ; </span >
60
+ </pre ></div >
61
+
62
+ <div style =" background : #ffffff ; overflow :auto ;width :auto ;border :solid gray ;border-width :.1em .1em .1em .8em ;padding :.2em .6em ;" ><pre style =" margin : 0 ; line-height : 125% " ><span style =" color : #008800 ; font-weight : bold " >def</span > <span style =" color : #0066BB ; font-weight : bold " >simplify_smile</span >(smile_str):
63
+ <span style="color: #DD4422">""" </span>
64
+ <span style =" color : #DD4422 " > Simplifies a SMILE string by removing hydrogen atoms (H), </span >
65
+ <span style =" color : #DD4422 " > chiral specifications (' ; @' ; ), charges (+ / -), ' ; #' ; -characters,</span >
66
+ <span style =" color : #DD4422 " > and square brackets (' ; [ ' ; , ' ; ] ' ; ).</span >
67
+
68
+ <span style =" color : #DD4422 " > Keyword Arguments:</span >
69
+ <span style =" color : #DD4422 " > smile_str (str): A smile string, e.g., C[ C@H] ( CCC(=O)NCCS(=O)(=O)[O-] ) </span >
70
+ <span style =" color : #DD4422 " > </span >
71
+ <span style =" color : #DD4422 " > Returns a simplified SMILE string, e.g., CC(CCC(=O)NCCS(=O)(=O)O)</span >
72
+
73
+ <span style =" color : #DD4422 " > " ;" ;" ; </span >
74
+ </pre ></div >
75
+
76
+
52
77
53
- <pre >def generate_zincid_smile_csv(zincid_list, out_file):
54
- Generates a CSV file of ZINC_ID,SMILE_string entries by querying the ZINC online
55
- database.
56
- Keyword arguments:
57
- zincid_list (str): Path to a UTF-8 or ASCII formatted file
58
- that contains 1 ZINC_ID per row. E.g.,
59
- ZINC0000123456
60
- ZINC0000234567
61
- [...]
62
- out_file (str): Path to a new output CSV file that will be written.
63
- print_prgress_bar (bool): Prints a progress bar to the screen if True.</pre >
64
-
65
- <pre >def check_duplicate_smiles(zincid_list, out_file, compare_simplified_smiles=False,
66
- print_progress_bar=False):
67
- Scans a ZINC_ID,SMILE_string CSV file for duplicate SMILE strings.
68
- Keyword arguments:
69
- zincid_list (str): Path to a UTF-8 or ASCII formatted file that
70
- contains 1 ZINC_ID per row.
71
- E.g.,
72
- ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC
73
- ZINC01234567,C[C@H]1CCCC[NH+]1CC#CC(c2ccccc2)(c3ccccc3)O
74
- [...]
75
- out_file (str): Path to a new output CSV file that will be written.
76
- compare_simplified_smiles (bool): If true, SMILE strings will be simplified
77
- for the comparison.</pre >
78
+ <div style =" background : #ffffff ; overflow :auto ;width :auto ;border :solid gray ;border-width :.1em .1em .1em .8em ;padding :.2em .6em ;" ><pre style =" margin : 0 ; line-height : 125% " ><span style =" color : #008800 ; font-weight : bold " >def</span > <span style =" color : #0066BB ; font-weight : bold " >generate_zincid_smile_csv</span >(zincid_list, out_file, print_progress_bar<span style =" color : #333333 " >=</span ><span style =" color : #007020 " >False</span >):
79
+ <span style="color: #DD4422">"""</span>
80
+ <span style =" color : #DD4422 " > Generates a CSV file of ZINC_ID,SMILE_string entries by querying the ZINC online</span >
81
+ <span style =" color : #DD4422 " > database.</span >
82
+
83
+ <span style =" color : #DD4422 " > Keyword arguments:</span >
84
+ <span style =" color : #DD4422 " > zincid_list (str): Path to a UTF-8 or ASCII formatted file </span >
85
+ <span style =" color : #DD4422 " > that contains 1 ZINC_ID per row. E.g., </span >
86
+ <span style =" color : #DD4422 " > ZINC0000123456</span >
87
+ <span style =" color : #DD4422 " > ZINC0000234567</span >
88
+ <span style =" color : #DD4422 " > [ ...] </span >
89
+ <span style =" color : #DD4422 " > out_file (str): Path to a new output CSV file that will be written.</span >
90
+ <span style =" color : #DD4422 " > print_prgress_bar (bool): Prints a progress bar to the screen if True.</span >
91
+
92
+ <span style =" color : #DD4422 " > " ;" ;" ; </span >
93
+ </pre ></div >
94
+
95
+
96
+ <div style =" background : #ffffff ; overflow :auto ;width :auto ;border :solid gray ;border-width :.1em .1em .1em .8em ;padding :.2em .6em ;" ><pre style =" margin : 0 ; line-height : 125% " ><span style =" color : #008800 ; font-weight : bold " >def</span > <span style =" color : #0066BB ; font-weight : bold " >check_duplicate_smiles</span >(zincid_list, out_file, compare_simplified_smiles<span style =" color : #333333 " >=</span ><span style =" color : #007020 " >False</span >):
97
+ <span style="color: #DD4422">"""</span>
98
+ <span style =" color : #DD4422 " > Scans a ZINC_ID,SMILE_string CSV file for duplicate SMILE strings.</span >
99
+
100
+ <span style =" color : #DD4422 " > Keyword arguments:</span >
101
+ <span style =" color : #DD4422 " > zincid_list (str): Path to a UTF-8 or ASCII formatted file that </span >
102
+ <span style =" color : #DD4422 " > contains 1 ZINC_ID + 1 SMILE String per row.</span >
103
+ <span style =" color : #DD4422 " > E.g., </span >
104
+ <span style =" color : #DD4422 " > ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC</span >
105
+ <span style =" color : #DD4422 " > ZINC01234567,C[ C@H] 1CCCC[ NH+] 1CC#CC(c2ccccc2)(c3ccccc3)O</span >
106
+ <span style =" color : #DD4422 " > [ ...] </span >
107
+ <span style =" color : #DD4422 " > out_file (str): Path to a new output CSV file that will be written.</span >
108
+ <span style =" color : #DD4422 " > compare_simplified_smiles (bool): If true, SMILE strings will be simplified</span >
109
+ <span style =" color : #DD4422 " > for the comparison.</span >
110
+ <span style =" color : #DD4422 " > </span >
111
+ <span style =" color : #DD4422 " > " ;" ;" ; </span >
112
+ </pre ></div >
113
+
114
+ <div style =" background : #ffffff ; overflow :auto ;width :auto ;border :solid gray ;border-width :.1em .1em .1em .8em ;padding :.2em .6em ;" ><pre style =" margin : 0 ; line-height : 125% " ><span style =" color : #008800 ; font-weight : bold " >def</span > <span style =" color : #0066BB ; font-weight : bold " >comp_two_files</span >(zincid_list1, zincid_list2, out_file, compare_simplified_smiles<span style =" color : #333333 " >=</span ><span style =" color : #007020 " >False</span >):
115
+ <span style="color: #DD4422">"""</span>
116
+ <span style =" color : #DD4422 " > Compares SMILE strings across two ZINC_ID files for duplicates </span >
117
+ <span style =" color : #DD4422 " > (does not check for duplicates within each file).</span >
118
+
119
+ <span style =" color : #DD4422 " > Keyword arguments:</span >
120
+ <span style =" color : #DD4422 " > zincid_list1 (str): Path to a UTF-8 or ASCII formatted file that </span >
121
+ <span style =" color : #DD4422 " > contains 1 ZINC_ID + 1 SMILE String per row.</span >
122
+ <span style =" color : #DD4422 " > E.g., </span >
123
+ <span style =" color : #DD4422 " > ZINC12345678,Cc1ccc(cc1C)OCCOc2c(cc(cc2I)/C=N/n3cnnc3)OC</span >
124
+ <span style =" color : #DD4422 " > ZINC01234567,C[ C@H] 1CCCC[ NH+] 1CC#CC(c2ccccc2)(c3ccccc3)O</span >
125
+ <span style =" color : #DD4422 " > [ ...] </span >
126
+ <span style =" color : #DD4422 " > zincid_list2 (str): Second ZINC_ID list file, similarly </span >
127
+ <span style =" color : #DD4422 " > out_file (str): Path to a new output CSV file that will be written.</span >
128
+ <span style =" color : #DD4422 " > compare_simplified_smiles (bool): If true, SMILE strings will be simplified</span >
129
+ <span style =" color : #DD4422 " > for the comparison.</span >
130
+ <span style =" color : #DD4422 " > </span >
131
+ <span style =" color : #DD4422 " > " ;" ;" ; </span >
132
+ </pre ></div >
78
133
79
- <pre >def simplify_smile(smile_str):
80
- Simplifies a SMILE string by removing hydrogen atoms (H),
81
- chiral specifications ('@'), charges (+ / -), '#'-characters,
82
- and square brackets ('[', ']').
83
- Keyword Arguments:
84
- smile_str (str): A smile string, e.g., C[C@H](CCC(=O)NCCS(=O)(=O)[O-])
85
- Returns a simplified SMILE string, e.g., CC(CCC(=O)NCCS(=O)(=O)O)</pre >
86
134
87
135
88
136
<br >
@@ -99,7 +147,9 @@ If you downloaded the smilite package from [https://pypi.python.org/pypi/smilite
99
147
<br >
100
148
<br >
101
149
102
- ###gen_zincid_smile_csv.py
150
+ <p ><a name =" gen_zincid " ></a ></p >
151
+
152
+ ###gen_zincid_smile_csv.py (downloading SMILES)
103
153
104
154
Generates a ZINC_ID,SMILE_STR csv file from a input file of
105
155
ZINC IDs. The input file should consist of 1 columns with 1 ZINC ID per row.
@@ -129,10 +179,11 @@ Downloading SMILES
129
179
130
180
<br >
131
181
<br >
182
+ <p ><a name =" comp_smile " ></a ></p >
132
183
133
- ###comp_smile_strings.py
184
+ ###comp_smile_strings.py (checking for duplicates within 1 file)
134
185
135
- Compares SMILE strings in a 2 column CSV file (ZINC_ID,SMILE_string) to identify duplicates. Generates a new CSV file with ZINC IDs of identified
186
+ Compares SMILE strings within a 2 column CSV file (ZINC_ID,SMILE_string) to identify duplicates. Generates a new CSV file with ZINC IDs of identified
136
187
duplicates listed in a 3rd-nth column(s).
137
188
138
189
** Usage:**
@@ -173,6 +224,56 @@ Where
173
224
![ ] ( https://raw.github.com/rasbt/smilite/master/images/comp_simple_smiles.png )
174
225
[ comp_simple_smiles.csv] ( https://raw.github.com/rasbt/smilite/master/examples/comp_simple_smiles.csv )
175
226
227
+ <br >
228
+ <br >
229
+ <p ><a name =" comp_2_smile " ></a ></p >
230
+
231
+ ###comp_2_smile_files.py (checking for duplicates across 2 files)
232
+
233
+ Compares SMILE strings between 2 input CSV files, where each file consists of rows with 2 columns ZINC_ID,SMILE_string to identify duplicate SMILE string across both files.
234
+ Generates a new CSV file with ZINC IDs of identified duplicates listed in a 3rd-nth column(s).
235
+
236
+
237
+ ** Usage:**
238
+ ` [shell]>> python3 comp_2_smile_files.py in1.csv in2.csv out.csv [simplify] `
239
+
240
+ ** Example:**
241
+ ` [shell]>> python3 comp_2_smile_files.py ../examples/zid_smiles2.csv ../examples/zid_smiles3.csv ../examples/comp_2_files.csv `
242
+
243
+
244
+ <br >
245
+
246
+ ** Input example file 1:**
247
+ ![ ] ( https://raw.github.com/rasbt/smilite/master/images/zid_smiles2.png )
248
+ [ zid_smiles2.csv] ( https://raw.github.com/rasbt/smilite/master/examples/zid_smiles2.csv )
249
+
250
+ <br >
251
+
252
+ ** Input example file 2:**
253
+ ![ ] ( https://raw.github.com/rasbt/smilite/master/images/zid_smiles3.png )
254
+ [ zid_smiles3.csv] ( https://raw.github.com/rasbt/smilite/master/examples/zid_smiles3.csv )
255
+
256
+ <br >
257
+
258
+ ** Output example file format:**
259
+ ![ ] ( https://raw.github.com/rasbt/smilite/master/images/comp_2_files.png )
260
+ [ comp_2_files.csv] ( https://raw.github.com/rasbt/smilite/master/examples/comp_2_files.csv )
261
+
262
+ <br >
263
+
264
+ Where:
265
+ - 1st column: name of the origin file
266
+ - 2nd column: ZINC ID
267
+ - 3rd column: SMILE string
268
+ - 4th-nth column: ZINC IDs of duplicates
269
+
270
+
271
+
272
+
273
+
274
+
275
+
276
+
176
277
<br >
177
278
<br >
178
279
@@ -194,6 +295,11 @@ or Twitter: [@rasbt](https://twitter.com/rasbt)
194
295
Changelog
195
296
==========
196
297
298
+ ** VERSION 1.3.0**
299
+
300
+ - added script and module function to compare SMILE strings across 2 files.
301
+
302
+
197
303
** VERSION 1.2.0**
198
304
199
305
- added Python 2.x support
0 commit comments