-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathA_splitrandom.py
378 lines (304 loc) · 12.9 KB
/
A_splitrandom.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
"""Random and Unpatterned Variation
This module handles dictionary entries with random and unpatterned variations,
ensuring accurate processing of complex English expressions within the parsing pipeline.
It identifies, segments, constructs, and integrates new entries, enhancing the overall
effectiveness of the package's data refinement process.
all subsequent python modules only work with two type of dictionary entries:
1. entries with a single English expression in the entry head
e.g.
*an A for effort Fig. acknowledgement for having tried
to do something, even if it was not successful. (*Typically:
get ~; give someone ~.) _ The plan didn’t work, but
I’ll give you an A for effort for trying.
2. entries with multiple English expressions in the entry head
e.g.
bail someone out of jail and bail someone out† 1. Lit. to
deposit a sum of money that allows someone to get out of
jail while waiting for a trial. _John was in jail. I had to go
down to the police station to bail him out. _ I need some
cash to bail out a friend! 2. Fig. to help someone who is
having difficulties. _ When my brother went broke, I had
to bail him out with a loan.
in both cases (as shown above), the English expressions are always located at the very
beginning of the entry (entry head) and then followed by one or more senses/definitions, each
comes with one or more dictionary examples
over 90% of the entries in the book fall into either one of the types mentioned above.
a third type (not yet implemented) has multiple English expressions that can show up anywhere in
entry body, and those expressions has multiples senses that could be shared between one expression
and not the others
e.g.
give someone a lift 1. and give someone a ride Fig. to
provide transportation for someone. _ I’ve got to get into
town. Can you give me a lift? 2. Fig. to raise someone’s
spirits; to make a person feel better. _ It was a good conversation,
and her kind words really gave me a lift.
in the example above:
- 2 expressions - give someone a lift & give someone a ride
- only the first expression has two senses. the second expression only has one sense
- that means if we were to parse this entry using C_readit.py as if it was type 2 entry,
the results will be confusing and misleading to the end user
{
"phrase": "give someone a lift",
"definition": "1. to provide transportation for someone. _ I’ve got to get into town.
Can you give me a lift? 2. to raise someone’s spirits;
to make a person feel better. _ It was a good conversation,
and her kind words really gave me a lift.",
}
{
"phrase": "give someone a ride",
"definition": "1. to provide transportation for someone. _ I’ve got to get into town.
Can you give me a lift? 2. to raise someone’s spirits;
to make a person feel better. _ It was a good conversation,
and her kind words really gave me a lift.",
}
the code below will parse such entries and generate new dictionary entries that can be either
a type 1 or 2. and then process it normally using subsequent python modules
Thought process:
step 1: locate all entries in clean_output.docx that are type 3 and capture their respective ranges
(count = 128)
step 2: cut each entry into smaller pieces. each piece should contain an entry head
(with one or more English expressions) with the subsequent senses.
using the example above, it would be broken down into two pieces
[
give someone a lift 1.
]
[
give someone a ride Fig. to
provide transportation for someone. _ I’ve got to get into
town. Can you give me a lift? 2. Fig. to raise someone’s
spirits; to make a person feel better. _ It was a good conversation,
and her kind words really gave me a lift.
]
step 3: construct new entries by rearranging alt/runs to give each entry the relevant
definition/examples
rule to follow: the first entry head inherits all subsequent senses, the rest only gets
the single sense associated with it
using the example above, it will be converted to the following two entries
[
give someone a lift 1. Fig. to provide transportation for
someone. _ I’ve got to get into town. Can you give me a lift?
2. Fig. to raise someone’s spirits; to make a person feel better.
_ It was a good conversation, and her kind words really gave me a lift.
]
[
give someone a ride Fig. to provide transportation for someone.
_ I’ve got to get into town. Can you give me a lift?
]
step 4: write the newly generated entries to clean_output.docx and capture their new ranges
- (start, end) for each entry (265 new entries gets generated)
step 5: remove captured_ranges from ranges.pickle, and then add new_ranges to it.
"""
import pickle
import re
import docx
from docx.shared import Pt # for run.font.size = Pt(r.font.size.pt)
from Z_module import copy_docx, runtype
# load entry ranges []
with open("files/ranges.pickle", "rb") as file:
ranges = pickle.load(file)
# open up clean-output.docx
doc = docx.Document("files/clean-output.docx")
lines = doc.paragraphs
# step 1
captured_ranges = []
for s, e in ranges:
entry_runs = [] # capture all entry runs
for i in range(s, e + 1):
runs = lines[i].runs
for ri, run in enumerate(runs):
entry_runs.append(run)
for i, r in enumerate(entry_runs):
# examine entry runs - look for '[1-9]\. and'
try:
if (
r.bold
and re.search(r"[1-9]\.", r.text.strip())
and (
runtype(i, entry_runs[i + 1]) == "and"
or runtype(i, entry_runs[i + 2]) == "and"
# or statement for (3620, 3628), (74394, 74400)
)
):
captured_ranges.append((s, e))
break
except IndexError:
continue
# save captured entries to a docx file - for better readability
copy_docx(captured_ranges, "entries_with_random_variation")
# step 2
entry_pieces = [] # [([[alt1], [alt2]], [[runs1], [runs2]]), (), (), ...]
for s, e in captured_ranges:
entry_alt = []
entry_runs = []
tmp_alt = []
tmp_runs = []
split_alt = []
split_runs = []
for i in range(s, e + 1):
runs = lines[i].runs
for ri, run in enumerate(runs):
entry_alt.append(runtype(ri, run, mode="MUL"))
entry_runs.append(run)
# keep the original formatting - add new line at the end of each paragraph
entry_alt.append("new_line")
entry_runs.append("new_line")
# break up entry_alt & entry_runs into smaller lists where each is an independent entry
for c, (a, r) in enumerate(zip(entry_alt, entry_runs)):
if a == "new_line":
tmp_alt.append(a)
tmp_runs.append(r)
continue
try:
if runtype(c, r) == "and" and (
(
entry_runs[c - 1].bold
and re.search(r"[1-9]\.", entry_runs[c - 1].text.strip())
)
or (
# this condistion is for [42891, 42899]
runtype(c, entry_runs[c - 1]) == "term"
and entry_runs[c - 2].bold
and re.search(r"[1-9]\.", entry_runs[c - 2].text.strip())
)
):
split_alt.append(tmp_alt)
split_runs.append(tmp_runs)
tmp_alt = []
tmp_runs = []
else:
tmp_alt.append(a)
tmp_runs.append(r) # capture run as is - not in text format
except IndexError:
continue
split_alt.append(tmp_alt)
split_runs.append(tmp_runs)
# for c, (la, lr) in enumerate(zip(split_alt, split_runs), start=1):
# print(f"{c}\n{la}\n", "-" * 50, f"\n{lr}")
# print("*" * 50)
entry_pieces.append((split_alt, split_runs))
# step 3
def headless_entry(entry_alt, entry_runs):
"""return entry body (definition/examples) after removing entry head
Args:
entry_alt (list): list of entry alternatives
entry_runs (list): list of entry runs
"""
# find indexes for definition|example|term|article
indexes = []
items_to_find = ["definition", "term", "example", "article"]
for item in items_to_find:
try:
index = entry_alt.index(item)
indexes.append(index)
except ValueError:
# Handle the case where the item is not found in entry_alt
pass
# sort indexes and remove duplicates
sorted_indexes = sorted(set(indexes))
min_index = (
sorted_indexes[0] if sorted_indexes[0] != 0 else sorted_indexes[1]
) # smallest index is the start of
return entry_alt[min_index:], entry_runs[min_index:]
def single_sense(entry_alt, entry_runs):
"""onyl retain the first sense/definition in a an entry with multiple senses/definitions
Args:
entry_alt (list): list of alternitaves
entry_runs (list): list of runs
"""
# get new_sense index
indexes = []
try:
index = entry_alt.index("new_sense")
indexes.append(index)
except ValueError:
pass
sorted_indexes = sorted(set(indexes))
if sorted_indexes and sorted_indexes[0] != 0:
return entry_alt[: sorted_indexes[0]], entry_runs[: sorted_indexes[0]]
if len(sorted_indexes) > 1 and sorted_indexes[0] == 0:
return entry_alt[: sorted_indexes[1]], entry_runs[: sorted_indexes[1]]
return entry_alt, entry_runs
for alts, runs in entry_pieces:
for indx in range(1, len(alts)):
# add all senses to the very first entry
additional_alt, additional_runs = headless_entry(alts[indx], runs[indx])
alts[0].extend(additional_alt)
runs[0].extend(additional_runs)
# make sure all senses have a single definition - starting from index #1
alts[indx], runs[indx] = single_sense(alts[indx], runs[indx])
# need to capture run.text in line 209 for this to work
# for alts, runs in entry_pieces:
# print("Entry")
# for i in range(len(alts)):
# print("".join(runs[i]))
# print("-" * 50)
# print("\n", "*" * 50)
# step 4
# write newly constructed entries to clean-output.docx
def runs_to_lines(lst_of_runs):
"""break a list of run on "new_line"
Args:
lst_of_runs (list): list of items that contain the string "new_line"
that should be used to break up the list.
"""
lines = []
tmp_line = []
for indx, itm in enumerate(lst_of_runs):
if indx == 0 and itm == "new_line":
continue
if indx == (len(lst_of_runs) - 1) and itm == "new_line":
break
if itm == "new_line":
lines.append(tmp_line)
tmp_line = []
else:
tmp_line.append(itm)
if tmp_line:
lines.append(tmp_line)
return lines
# open up clean-output.docx
doc = docx.Document("files/clean-output.docx")
lines = doc.paragraphs
# i've hardcoded the total number of lines in the original file of clean-output.docx to avoid
# writing the new entries a again into the document with every new run to this module.
line_number = 89728
FIRST_TIME = True # check if this was the first time we ran this module or not
if len(lines) - 1 > line_number:
FIRST_TIME = False
new_ranges = []
for alts, runs in entry_pieces:
for new_entry in runs:
# break [new_entry] into lines e.g. [[line1], [line2], ...]
new_entry_lines = runs_to_lines(new_entry)
entry_line_numbers = [] # new entry line numbers in the docx file
for line in new_entry_lines:
paragraph = doc.add_paragraph() # initiate blank line
paragraph.paragraph_format.space_before = Pt(1)
paragraph.paragraph_format.space_after = Pt(1)
for r in line:
# add each run to the new paragraph/line in doc
run = paragraph.add_run(r.text)
# apply original run style
if r.bold:
run.bold = True
if r.italic:
run.italic = True
run.font.name = r.font.name
run.font.size = Pt(r.font.size.pt)
# Increment line number counter after adding the paragraph
line_number += 1
entry_line_numbers.append(line_number)
new_ranges.append(
(entry_line_numbers[0], entry_line_numbers[-1])
) # store as tuple (start, end)
if FIRST_TIME:
doc.save("files/clean-output.docx")
# step 5
with open("files/ranges.pickle", "rb") as file:
ranges = pickle.load(file)
# remove captured_ranges from ranges.pickle
ranges = [r for r in ranges if r not in captured_ranges]
# add new_ranges
ranges.extend(item for item in new_ranges if item not in ranges)
with open("files/ranges.pickle", "wb") as file:
pickle.dump(ranges, file)