-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathM_validate.py
230 lines (189 loc) · 8.86 KB
/
M_validate.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
r""" Script for Validating the Search Algorithm
Description:
This script is designed to validate the functionality of the `get_potential_matches()` and
`look_closer()` functions in 'L_algorithm.py' by testing them against a large dataset of examples.
The purpose is to verify if the results obtained from the search algorithm correctly match the
expected dictionary range for each example.
This validation process is critical for ensuring the accuracy and reliability of the search algorithm.
The script works as follows:
1. It loads a pre-parsed list of examples, where each example includes its expected dictionary
range and the associated English sentence.
2. The script processes each example in parallel, using multiprocessing to speed up execution time.
3. For each example, it calls the `get_potential_matches()` function to find potential dictionary
entries and then uses `look_closer()` to refine the matches.
4. It checks if the correct dictionary range for the example exists in the results obtained
from the search algorithm. If the range is found, the example is considered a successful match.
5. The results for each example, including the example itself, the expected range,
the obtained results, the count of matches, and whether it's a successful match, are recorded.
6. The final output is saved to a pickle file, which will serve as input for 'N_report.py'.
This report generation script will further analyze the data and present it in a
human-readable format to demonstrate the accuracy of the search algorithm.
Note: my macbook air struggles to go through the 18395 entries in one go in order to create a
validation. so i've broken down the task into 19 smaller pieces. i use run_validation.sh to start
a new validation and validation_progress.pickle to keep track of the progress. we every new
validation, you need to delete validation_progress.pickle.
Input:
- examples.pickle
Output:
- validation_data.pickle
- validation_progress.pickle
Runtime:
The runtime for the validation process, when executed on a single CPU core,
is estimated to be around 30 hours. (10 hours on MacBook Air utilizing 8 cores
and running the validation throught run_validation.sh)
Usage:
Please run this script from the command line (CMD)
Example:
python M_validate.py
Validation history:
- validation_data_29_06_2023_21_55
is the first validation i ran, it contained all entries in phrases.json - nothing chopped off.
after running it, i decided to remove the 318 most frequently occurring incorrect entries
from 'phrases.json' and ran another validation
- validation_data_03_07_2023_14_14
- validation_data_26_02_2024_5_42
ran this validation after updating the algorithm and switching from using record
to combinations - no entries were removed.
- validation_data_28_02_2024_9_01
made a minor change to regrex in algorithm changed \b to \W to include words that ends
with an apostrophe
- validation_data_01_03_2024_8_20
reason:
that previous change made some problems appear that weren't there before, so i needed to
update the regex and span in look_closer() function.
results:
slightly better than validation_data_26_02_2024_5_42.
only one percent increase in algorithm accuracy.
- validation_data_12_03_2024_02_09
reason:
i've added the python module A_splitrandom into the workflow and it addes 265 new entries to
phrases.json. i've also changes how examples get parsed in K_getexamples - so more
examples/entries.
results:
accuracy is still at 91%
- validation_data_13_03_2024
reason:
i've decreased the distance factor calculated by max_words_between() in L_algorithm down to 2
(3 was the original number) to see how that would affect the accuracy of the results
results:
acurracy went down to 88%, with no segnificant improvement in any aspect, so i'm reverting the
distance back to 3
"""
import datetime
import itertools
import os
import pickle
import sys
from multiprocessing import Pool
import nltk
from nltk.stem.wordnet import WordNetLemmatizer
from tqdm import tqdm
from englishidioms.L_algorithm import get_potential_matches, look_closer
# manually point nltk to my top level nltk_data folder that includes wordnet
# https://www.nltk.org/data.html
nltk_data_dir = os.path.join(os.path.dirname(__file__), "englishidioms", "nltk_data")
nltk.data.path.append(nltk_data_dir)
def process_example(example):
"""
Process an example for validation.
Given an example containing a known dictionary range and an English sentence,
this function performs the following tasks:
1. Executes the search algorithm by calling `get_potential_matches()` and `look_closer()`
functions to identify potential dictionary entries and refined idiomatic expression matches.
2. Compares the known dictionary range with the ranges obtained from the search results.
3. Records the validation results, indicating whether the known range correctly matches
the search results.
Args:
example (list): A list containing the known dictionary range and an English sentence example
e.g. [[[0, 3], ['The plan didn’t work, but I’ll give you an A for effort for trying.']]
Returns:
results (list): A list of dictionaries, each containing information about the validation
result for a specific example.
"""
example_range = example[0]
results = []
for ex in example[1]:
# run the example against the search algorithm
# Tokenize the sentence into words
words = nltk.word_tokenize(ex.lower())
# Tag the words with their respective parts of speech
pos_tags = nltk.pos_tag(words)
# Lemmatize each word based on its POS tag
sentence_lemma = [
(
WordNetLemmatizer().lemmatize(word, pos[0].lower())
if pos[0].lower() in ["a", "n", "v"]
else WordNetLemmatizer().lemmatize(word)
)
for word, pos in pos_tags
]
potential_matches = get_potential_matches(ex.lower(), sentence_lemma)
lc = look_closer(potential_matches, ex.lower(), sentence_lemma)
matches = list(
k for k, _ in itertools.groupby([item[0]["range"] for item in lc[:10]])
)
results.append(
{
"example": ex,
"example_range": example_range,
"results": matches,
"results_count": len(matches),
"match": example_range in matches, # True or False
"match_index": (
matches.index(example_range) + 1
if example_range in matches
else "NA"
),
}
)
return results
if __name__ == "__main__":
# Load up all the examples - a list of lists
# [[[0, 3], ['The plan didn’t work, but I’ll give you an A for effort for trying.']]
with open("files/examples.pickle", "rb") as file:
examples = pickle.load(file)
# macbook air struggles to go through the 18395 entries in one go
# breaking the task into smaller parts
num_parts = 18
part_length = len(examples) // num_parts
parts = [
examples[i : i + part_length] for i in range(0, len(examples), part_length)
]
# Generate a timestamp & file name
date_time = datetime.datetime.now()
file_name = f'validation_data_{date_time.strftime("%d_%m_%Y")}'
# file_name = "validation_data_13_03_2024"
if os.path.exists(f"files/{file_name}.pickle"):
with open(f"files/{file_name}.pickle", "rb") as file:
output = pickle.load(file)
else:
output = []
if os.path.exists("files/validation_progress.pickle"):
with open("files/validation_progress.pickle", "rb") as file:
validation_progress = pickle.load(file)
else:
validation_progress = 0
if validation_progress > len(parts) - 1:
sys.exit(f"All {len(parts)} parts have been processed.")
print(f"Part {validation_progress + 1} out of {len(parts)}")
# Set the number of processes to the desired number of threads
num_processes = 8
with Pool(num_processes) as pool:
# Map the processing function to each example in parallel
results = list(
tqdm(
pool.imap(process_example, parts[validation_progress]),
total=len(parts[validation_progress]),
)
)
# Collect the results from all processes
for res in results:
output.extend(res)
# Save the output to a pickle file for 'N_report.py'
with open(f"files/{file_name}.pickle", "wb") as file:
pickle.dump(output, file)
# updated progress and save it to 'validation_progress.pickle'
validation_progress += 1
with open("files/validation_progress.pickle", "wb") as file:
pickle.dump(validation_progress, file)
print(f"Part {validation_progress} completed.")