-
Notifications
You must be signed in to change notification settings - Fork 4
/
rayyan.py.out
185 lines (172 loc) · 10.2 KB
/
rayyan.py.out
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.1.2
/_/
Using Python version 3.6.8 (default, Dec 29 2018 19:04:46)
SparkSession available as 'spark'.
Delphi APIs (version 0.1.0-spark3.1-EXPERIMENTAL) available as 'delphi'.
>>> # Loads a target data then defines tables for it
... spark.read \
... .option("header", True) \
... .csv("./testdata/raha/rayyan.csv") \
... .write \
... .saveAsTable("rayyan")
>>>
>>> delphi.misc \
... .options({"db_name": "default", "table_name": "rayyan", "row_id": "id"}) \
... .flatten() \
... .write \
... .saveAsTable("rayyan_flatten")
>>>
>>> spark.table("rayyan").show(1)
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
| id| article_title|article_language| journal_title|jounral_abbreviation| journal_issn|article_jvolumn|article_jissue|article_jcreated_at|article_pagination| author_list|
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
|235295|Late repair of in...| eng|Proc R Soc Med|Proceedings of th...|0035-9157 (Print)...| 64| 12| 1/1/71| 1187-9|"{""A. G. Parks""|
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
only showing top 1 row
>>> spark.table("rayyan_flatten").show(1)
+------+-------------+--------------------+
| id| attribute| value|
+------+-------------+--------------------+
|235295|article_title|Late repair of in...|
+------+-------------+--------------------+
only showing top 1 row
>>>
>>> # Loads a ground truth data then defines tables for it
... spark.read \
... .option("header", True) \
... .csv("./testdata/raha/rayyan_clean.csv") \
... .write \
... .saveAsTable("rayyan_clean")
>>>
>>> spark.table("rayyan_flatten") \
... .join(spark.table("rayyan_clean"), ["id", "attribute"], "inner") \
... .where("not(value <=> correct_val)") \
... .write \
... .saveAsTable("error_cells_ground_truth")
>>>
>>> spark.table("rayyan_clean").show(1)
+------+-------------+--------------------+
| id| attribute| correct_val|
+------+-------------+--------------------+
|235295|article_title|Late repair of in...|
+------+-------------+--------------------+
only showing top 1 row
>>> spark.table("error_cells_ground_truth").show(1)
+------+---------------+-----+-----------+
| id| attribute|value|correct_val|
+------+---------------+-----+-----------+
|498345|article_jvolumn| null| -1|
+------+---------------+-----+-----------+
only showing top 1 row
>>> # Detects error cells then repairs them
... repaired_df = delphi.repair \
... .setDbName("default") \
... .setTableName("rayyan") \
... .setRowId("id") \
... .setErrorCells("error_cells_ground_truth") \
... .setDiscreteThreshold(400) \
... .option("model.hp.no_progress_loss", "100") \
... .run()
# Computes performance numbers (precision & recall)
# - Precision: the fraction of correct repairs, i.e., repairs that match
# the ground truth, over the total number of repairs performed
# - Recall: correct repairs over the total number of errors
pdf = repaired_df.join(spark.table("rayyan_clean"), ["id", "attribute"], "inner")
rdf = repaired_df.join(
spark.table("error_cells_ground_truth"),
["id", "attribute"], "right_outer")
# Compares predicted values with the correct ones
pdf.orderBy("attribute").show()
precision = pdf.where("repaired <=> correct_val").count() / pdf.count()
recall = rdf.where("repaired <=> correct_val").count() / rdf.count()
f1 = (2.0 * precision * recall) / (precision + recall)
print("Precision={} Recall={} F1={}".format(precision, recall, f1))2021-08-20 21:29:40.726:[Error Detection Phase] Error cells provided by `error_cells_20210820212940`
2021-08-20 21:29:41.239:[Error Detection Phase] 897 noisy cells found (8.97%)
2021-08-20 21:29:41.239:Elapsed time (name: error detection) is 0.5977351665496826(s)
2021-08-20 21:29:42.522:Target repairable columns are article_jcreated_at,article_language,article_jissue,article_jvolumn in noisy columns (article_jcreated_at,journal_title,article_language,article_title,article_jissue,article_jvolumn,author_list,journal_issn,article_pagination)
2021-08-20 21:29:42.522:[Error Detection Phase] Analyzing cell domains to fix error cells...
2021-08-20 21:29:45.245:Elapsed time (name: cell domain analysis) is 2.722935914993286(s)
2021-08-20 21:29:45.976:[Error Detection Phase] 0 noisy cells fixed and 897 error cells (8.97%) remaining...
2021-08-20 21:29:48.226:[Repair Model Training Phase] Building 4 models to repair the cells in article_jcreated_at,article_language,article_jissue,article_jvolumn
2021-08-20 21:29:49.450:Building 1/4 model... type=classfier y=article_jcreated_at features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jissue,article_pagination,author_list #rows=193 #class=106
2021-08-20 21:37:00.834:hyperopt: #eval=359/100000000
2021-08-20 21:37:02.954:Finishes building 'article_jcreated_at' model... score=0.013801562117544972 elapsed=433.5030448436737s
2021-08-20 21:37:03.132:Building 2/4 model... type=classfier y=article_language features=article_title,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jissue,article_jcreated_at,article_pagination,author_list #rows=641 #class=33
2021-08-20 21:40:39.980:hyperopt: #eval=262/100000000
2021-08-20 21:40:41.009:Finishes building 'article_language' model... score=0.13325384613385818 elapsed=217.8760130405426s
2021-08-20 21:40:41.186:Building 3/4 model... type=classfier y=article_jissue features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jcreated_at,article_pagination,author_list #rows=947 #class=42
2021-08-20 21:55:14.069:hyperopt: #eval=484/100000000
2021-08-20 21:55:15.929:Finishes building 'article_jissue' model... score=0.0537545889732808 elapsed=874.7432188987732s
2021-08-20 21:55:16.109:Building 4/4 model... type=classfier y=article_jvolumn features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jissue,article_jcreated_at,article_pagination,author_list #rows=976 #class=188
2021-08-20 22:11:08.753:hyperopt: #eval=159/100000000
2021-08-20 22:11:15.956:Finishes building 'article_jvolumn' model... score=0.017940547637166914 elapsed=959.8459448814392s
2021-08-20 22:11:15.956:Elapsed time (name: repair model training) is 2488.9358837604523(s)
2021-08-20 22:11:18.752:[Repairing Phase] Computing 797 repair updates in 744 rows...
2021-08-20 22:11:24.558:Elapsed time (name: repairing) is 8.599979877471924(s)
2021-08-20 22:11:24.790:!!!Total Processing time is 2504.1765060424805(s)!!!
>>>
>>> # Computes performance numbers (precision & recall)
... # - Precision: the fraction of correct repairs, i.e., repairs that match
... # the ground truth, over the total number of repairs performed
... # - Recall: correct repairs over the total number of errors
... pdf = repaired_df.join(
... spark.table("rayyan_clean").where("attribute != 'Score'"),
... ["id", "attribute"], "inner")
>>> rdf = repaired_df.join(
... spark.table("error_cells_ground_truth").where("attribute != 'Score'"),
... ["id", "attribute"], "right_outer")
>>>
>>> # Compares predicted values with the correct ones
... pdf.orderBy("attribute").show()
+------+-------------------+-------------+--------+-----------+
| id| attribute|current_value|repaired|correct_val|
+------+-------------------+-------------+--------+-----------+
|258486|article_jcreated_at| 1/1/10| 5/1/90| 1/10/01|
|405141|article_jcreated_at| 1/1/13| 1/1/78| 1/13/01|
|263027|article_jcreated_at| 1/1/13| 1/1/96| 1/13/01|
|102893|article_jcreated_at| 11/1/10| 1/1/93| 1/10/11|
|278396|article_jcreated_at| 1/1/13| 5/1/90| 1/13/01|
|111970|article_jcreated_at| 6/1/14| 9/23/14| 1/14/06|
|282039|article_jcreated_at| 1/1/08| 8/13/10| 1/8/01|
|131820|article_jcreated_at| 12/1/13| 5/1/90| 1/13/12|
|292018|article_jcreated_at| 8/1/02| 1/1/93| 1/2/08|
|164517|article_jcreated_at| 1/1/02| 5/22/14| 1/2/01|
|293974|article_jcreated_at| 1/1/14| 1/1/93| 1/14/01|
|176160|article_jcreated_at| 1/1/11| 1/1/96| 1/11/01|
|294719|article_jcreated_at| 11/1/13| 1/1/78| 1/13/11|
|184949|article_jcreated_at| 1/1/10| 1/1/96| 1/10/01|
|299635|article_jcreated_at| 5/1/11| 1/1/78| 1/11/05|
| 22708|article_jcreated_at| 1/1/14| 1/1/00| 1/14/01|
|385346|article_jcreated_at| 10/1/13| 1/1/93| 1/13/10|
|302969|article_jcreated_at| 1/1/07| 1/1/78| 1/7/01|
|245822|article_jcreated_at| 1/1/02| 1/1/78| 1/2/01|
|305933|article_jcreated_at| 1/1/14| 1/1/93| 1/14/01|
+------+-------------------+-------------+--------+-----------+
only showing top 20 rows
>>>
>>> precision = pdf.where("repaired <=> correct_val").count() / pdf.count()
>>> recall = rdf.where("repaired <=> correct_val").count() / rdf.count()
>>> f1 = (2.0 * precision * recall) / (precision + recall)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ZeroDivisionError: float division by zero
>>>
>>> print("Precision={} Recall={} F1={}".format(precision, recall, f1))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'f1' is not defined
>>> print("Precision={} Recall={} F1=".format(precision, recall))
Precision=0.0 Recall=0.0 F1=
>>>
>>> f1 = (2.0 * precision * recall) / (precision + recall + 0.0001)
>>>
>>> print("Precision={} Recall={} F1={}".format(precision, recall, f1))
Precision=0.0 Recall=0.0 F1=0.0