-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathB_breakitup.py
163 lines (130 loc) · 5.8 KB
/
B_breakitup.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
""" Separate single phrase entries from multiple phrase entries
Description:
This script goes through ranges.pickle and determines if each dictionary entry contains
a single or multiple idiomatic expressions.
Example of a dictionary entry with a single expression:
abandon oneself to something to yield to the comforts or
delights of something. _ The children abandoned themselves
to the delights of the warm summer day.
Example of a dictionary entry with multiple expressions:
able to do something blindfolded and able to do something
standing on one’s head Fig. able to do something
very easily, possibly without even looking. (Able to can be
replaced with can.) _ Bill boasted that he could pass his
driver’s test blindfolded.
Input:
- ranges.pickle: This pickle file contains a list of tuples, where each tuple
consists of two integers representing the start and end lines of dictionary
entries in 'clean-output.docx'.
Output:
- ranges_SNGL.pickle: Ranges for entries with single expressions (list of tuples).
- ranges_MULT.pickle: Ranges for entries with multiple expressions (list of tuples).
- single_phrase_entries.txt (optional): Entries with single expressions (text file).
- single_phrase_entries.docx (optional): Entries with single expressions in DOCX format
for better readability.
- multiple_phrase_entries.txt (optional): Entries with multiple expressions (text file).
- multiple_phrase_entries.docx (optional): Entries with multiple expressions in DOCX format
for better readability.
Thought Process:
- A dictionary entry consists of three core parts: entry head, definition, and example,
each with unique font formatting.
- We differentiate between single and multiple idiomatic expressions based on characteristics
of the entry head.
- To analyze each entry in 'clean-output.docx', we break it down into 'runs' and use the
'runtype()' function to identify each part.
- This approach helps us identify the entry head.
- If the entry head includes 'and' in 'Minion-Regular' font with a text size of 7,
it indicates multiple expressions.
- If the entry head contains a semicolon ';', it also suggests the presence of multiple expressions.
Runtime:
- Creating ranges_SNGL.pickle and ranges_MULT.pickle: Completed in 19 seconds.
- Creating single_phrase_entries.txt: Completed in 2 minutes and 17 seconds (optional).
- Creating single_phrase_entries.docx: Completed in 9 minutes and 45 seconds (optional).
- Creating multiple_phrase_entries.txt: Completed 5 seconds (optional).
- Creating multiple_phrase_entries.docx: Completed 27 seconds (optional).
Usage:
Please run this script from the command line (CMD)
Example:
python B_breakitup.py
"""
import pickle
import docx
from tqdm import tqdm
from Z_module import cleanup, copy_docx, runtype
# load [ranges]
with open("files/ranges.pickle", "rb") as file:
ranges = pickle.load(file)
# open up clean-output.docx
doc = docx.Document("files/clean-output.docx")
lines = doc.paragraphs
ranges_SNGL = [] # a list of ranges - entries with single phrase
ranges_MULT = [] # a list of ranges - entries with multiple phrases
# go through each entry (start, end) and determine if it has a single phrase or multiple phrases
for s, e in tqdm(ranges):
if (s, e) in [(26246, 26255)]:
# hardcoding this one as it did not follow expected pattern
ranges_MULT.append((s, e))
continue
multiple_phrases = False
# create two lists for runs/alt pairs for each entry
line_alt = []
line_runs = []
for i in range(s, e + 1):
runs = lines[i].runs
for ri, run in enumerate(runs):
line_alt.append(runtype(ri, run))
line_runs.append(run.text)
# remove all items in both lists from the 1st 'definition' and beyond - only keep the entry head
line_alt, line_runs = cleanup(line_alt, line_runs)
for a, r in zip(line_alt, line_runs):
if (a == "and") or (a == "variable" and r.strip().endswith(";")):
multiple_phrases = True
if multiple_phrases:
# 2422 multiple phrase entries
ranges_MULT.append((s, e))
else:
ranges_SNGL.append((s, e))
# Output File #1
# pickle ranges_SNGL
with open("files/ranges_SNGL.pickle", "wb") as file:
pickle.dump(ranges_SNGL, file)
# Output File #2
# pickle ranges_MULT
with open("files/ranges_MULT.pickle", "wb") as file:
pickle.dump(ranges_MULT, file)
# Output File #3 (optional)
# save single phrase entries to a text file
print("creating single_phrase_entries.txt")
text_file_1 = str()
for s, e in tqdm(ranges_SNGL):
# add range
text_file_1 = text_file_1 + "\n" + f"[{s}, {e}]"
# add entry lines from clean-output.docx
for i in range(s, e + 1):
text_file_1 = text_file_1 + "\n" + lines[i].text
# add a line break after each entry
text_file_1 = text_file_1 + "\n" + "*" * 50
# write string to desk
with open("files/single_phrase_entries.txt", "w", encoding="UTF-8") as myfile:
myfile.write(text_file_1)
# Output File #4 (optional)
# save single phrase entries to a docx file - for better readability
copy_docx(ranges_SNGL, "single_phrase_entries")
# Output File #5 (optional)
# save multiple phrase entries to a text file
print("creating multiple_phrase_entries.txt")
text_file_2 = str()
for s, e in tqdm(ranges_MULT):
# add range
text_file_2 = text_file_2 + "\n" + f"[{s}, {e}]"
# add entry lines from clean-output.docx
for i in range(s, e + 1):
text_file_2 = text_file_2 + "\n" + lines[i].text
# add a line break after each entry
text_file_2 = text_file_2 + "\n" + "*" * 50
# write string to desk
with open("files/multiple_phrase_entries.txt", "w", encoding="UTF-8") as myfile:
myfile.write(text_file_2)
# Output File #6 (optional)
# save multiple phrase entries to a docx file - for better readability
copy_docx(ranges_MULT, "multiple_phrase_entries")