resources/examples/rayyan.py.out

Python 3.6.8 |Anaconda, Inc.| (default, Dec 29 2018, 19:04:46) 

Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.2
      /_/

Using Python version 3.6.8 (default, Dec 29 2018 19:04:46)
SparkSession available as 'spark'.
Delphi APIs (version 0.1.0-spark3.1-EXPERIMENTAL) available as 'delphi'.
>>> # Loads a target data then defines tables for it
... spark.read \
...     .option("header", True) \
...     .csv("./testdata/raha/rayyan.csv") \
...     .write \
...     .saveAsTable("rayyan")
>>>
>>> delphi.misc \
...     .options({"db_name": "default", "table_name": "rayyan", "row_id": "id"}) \
...     .flatten() \
...     .write \
...     .saveAsTable("rayyan_flatten")
>>>
>>> spark.table("rayyan").show(1)
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
|    id|       article_title|article_language| journal_title|jounral_abbreviation|        journal_issn|article_jvolumn|article_jissue|article_jcreated_at|article_pagination|      author_list|
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
|235295|Late repair of in...|             eng|Proc R Soc Med|Proceedings of th...|0035-9157 (Print)...|             64|            12|             1/1/71|            1187-9|"{""A. G. Parks""|
+------+--------------------+----------------+--------------+--------------------+--------------------+---------------+--------------+-------------------+------------------+-----------------+
only showing top 1 row

>>> spark.table("rayyan_flatten").show(1)
+------+-------------+--------------------+
|    id|    attribute|               value|
+------+-------------+--------------------+
|235295|article_title|Late repair of in...|
+------+-------------+--------------------+
only showing top 1 row

>>>
>>> # Loads a ground truth data then defines tables for it
... spark.read \
...     .option("header", True) \
...     .csv("./testdata/raha/rayyan_clean.csv") \
...     .write \
...     .saveAsTable("rayyan_clean")
>>>
>>> spark.table("rayyan_flatten") \
...     .join(spark.table("rayyan_clean"), ["id", "attribute"], "inner") \
...     .where("not(value <=> correct_val)") \
...     .write \
...     .saveAsTable("error_cells_ground_truth")
>>>
>>> spark.table("rayyan_clean").show(1)
+------+-------------+--------------------+
|    id|    attribute|         correct_val|
+------+-------------+--------------------+
|235295|article_title|Late repair of in...|
+------+-------------+--------------------+
only showing top 1 row

>>> spark.table("error_cells_ground_truth").show(1)
+------+---------------+-----+-----------+
|    id|      attribute|value|correct_val|
+------+---------------+-----+-----------+
|498345|article_jvolumn| null|         -1|
+------+---------------+-----+-----------+
only showing top 1 row

>>> # Detects error cells then repairs them
... repaired_df = delphi.repair \
...     .setDbName("default") \
...     .setTableName("rayyan") \
...     .setRowId("id") \
...     .setErrorCells("error_cells_ground_truth") \
...     .setDiscreteThreshold(400) \
...     .option("model.hp.no_progress_loss", "100") \
...     .run()

# Computes performance numbers (precision & recall)
#  - Precision: the fraction of correct repairs, i.e., repairs that match
#    the ground truth, over the total number of repairs performed
#  - Recall: correct repairs over the total number of errors
pdf = repaired_df.join(spark.table("rayyan_clean"), ["id", "attribute"], "inner")
rdf = repaired_df.join(
    spark.table("error_cells_ground_truth"),
    ["id", "attribute"], "right_outer")

# Compares predicted values with the correct ones
pdf.orderBy("attribute").show()

precision = pdf.where("repaired <=> correct_val").count() / pdf.count()
recall = rdf.where("repaired <=> correct_val").count() / rdf.count()
f1 = (2.0 * precision * recall) / (precision + recall)


print("Precision={} Recall={} F1={}".format(precision, recall, f1))2021-08-20 21:29:40.726:[Error Detection Phase] Error cells provided by `error_cells_20210820212940`
2021-08-20 21:29:41.239:[Error Detection Phase] 897 noisy cells found (8.97%)
2021-08-20 21:29:41.239:Elapsed time (name: error detection) is 0.5977351665496826(s)
2021-08-20 21:29:42.522:Target repairable columns are article_jcreated_at,article_language,article_jissue,article_jvolumn in noisy columns (article_jcreated_at,journal_title,article_language,article_title,article_jissue,article_jvolumn,author_list,journal_issn,article_pagination)
2021-08-20 21:29:42.522:[Error Detection Phase] Analyzing cell domains to fix error cells...
2021-08-20 21:29:45.245:Elapsed time (name: cell domain analysis) is 2.722935914993286(s)
2021-08-20 21:29:45.976:[Error Detection Phase] 0 noisy cells fixed and 897 error cells (8.97%) remaining...
2021-08-20 21:29:48.226:[Repair Model Training Phase] Building 4 models to repair the cells in article_jcreated_at,article_language,article_jissue,article_jvolumn
2021-08-20 21:29:49.450:Building 1/4 model... type=classfier y=article_jcreated_at features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jissue,article_pagination,author_list #rows=193 #class=106
2021-08-20 21:37:00.834:hyperopt: #eval=359/100000000
2021-08-20 21:37:02.954:Finishes building 'article_jcreated_at' model...  score=0.013801562117544972 elapsed=433.5030448436737s
2021-08-20 21:37:03.132:Building 2/4 model... type=classfier y=article_language features=article_title,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jissue,article_jcreated_at,article_pagination,author_list #rows=641 #class=33
2021-08-20 21:40:39.980:hyperopt: #eval=262/100000000
2021-08-20 21:40:41.009:Finishes building 'article_language' model...  score=0.13325384613385818 elapsed=217.8760130405426s
2021-08-20 21:40:41.186:Building 3/4 model... type=classfier y=article_jissue features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jvolumn,article_jcreated_at,article_pagination,author_list #rows=947 #class=42
2021-08-20 21:55:14.069:hyperopt: #eval=484/100000000
2021-08-20 21:55:15.929:Finishes building 'article_jissue' model...  score=0.0537545889732808 elapsed=874.7432188987732s
2021-08-20 21:55:16.109:Building 4/4 model... type=classfier y=article_jvolumn features=article_title,article_language,journal_title,jounral_abbreviation,journal_issn,article_jissue,article_jcreated_at,article_pagination,author_list #rows=976 #class=188
2021-08-20 22:11:08.753:hyperopt: #eval=159/100000000
2021-08-20 22:11:15.956:Finishes building 'article_jvolumn' model...  score=0.017940547637166914 elapsed=959.8459448814392s
2021-08-20 22:11:15.956:Elapsed time (name: repair model training) is 2488.9358837604523(s)
2021-08-20 22:11:18.752:[Repairing Phase] Computing 797 repair updates in 744 rows...
2021-08-20 22:11:24.558:Elapsed time (name: repairing) is 8.599979877471924(s)
2021-08-20 22:11:24.790:!!!Total Processing time is 2504.1765060424805(s)!!!
>>>
>>> # Computes performance numbers (precision & recall)
... #  - Precision: the fraction of correct repairs, i.e., repairs that match
... #    the ground truth, over the total number of repairs performed
... #  - Recall: correct repairs over the total number of errors
... pdf = repaired_df.join(
...     spark.table("rayyan_clean").where("attribute != 'Score'"),
...     ["id", "attribute"], "inner")
>>> rdf = repaired_df.join(
...     spark.table("error_cells_ground_truth").where("attribute != 'Score'"),
...     ["id", "attribute"], "right_outer")
>>>
>>> # Compares predicted values with the correct ones
... pdf.orderBy("attribute").show()
+------+-------------------+-------------+--------+-----------+
|    id|          attribute|current_value|repaired|correct_val|
+------+-------------------+-------------+--------+-----------+
|258486|article_jcreated_at|       1/1/10|  5/1/90|    1/10/01|
|405141|article_jcreated_at|       1/1/13|  1/1/78|    1/13/01|
|263027|article_jcreated_at|       1/1/13|  1/1/96|    1/13/01|
|102893|article_jcreated_at|      11/1/10|  1/1/93|    1/10/11|
|278396|article_jcreated_at|       1/1/13|  5/1/90|    1/13/01|
|111970|article_jcreated_at|       6/1/14| 9/23/14|    1/14/06|
|282039|article_jcreated_at|       1/1/08| 8/13/10|     1/8/01|
|131820|article_jcreated_at|      12/1/13|  5/1/90|    1/13/12|
|292018|article_jcreated_at|       8/1/02|  1/1/93|     1/2/08|
|164517|article_jcreated_at|       1/1/02| 5/22/14|     1/2/01|
|293974|article_jcreated_at|       1/1/14|  1/1/93|    1/14/01|
|176160|article_jcreated_at|       1/1/11|  1/1/96|    1/11/01|
|294719|article_jcreated_at|      11/1/13|  1/1/78|    1/13/11|
|184949|article_jcreated_at|       1/1/10|  1/1/96|    1/10/01|
|299635|article_jcreated_at|       5/1/11|  1/1/78|    1/11/05|
| 22708|article_jcreated_at|       1/1/14|  1/1/00|    1/14/01|
|385346|article_jcreated_at|      10/1/13|  1/1/93|    1/13/10|
|302969|article_jcreated_at|       1/1/07|  1/1/78|     1/7/01|
|245822|article_jcreated_at|       1/1/02|  1/1/78|     1/2/01|
|305933|article_jcreated_at|       1/1/14|  1/1/93|    1/14/01|
+------+-------------------+-------------+--------+-----------+
only showing top 20 rows

>>>
>>> precision = pdf.where("repaired <=> correct_val").count() / pdf.count()
>>> recall = rdf.where("repaired <=> correct_val").count() / rdf.count()
>>> f1 = (2.0 * precision * recall) / (precision + recall)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: float division by zero
>>>
>>> print("Precision={} Recall={} F1={}".format(precision, recall, f1))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'f1' is not defined
>>> print("Precision={} Recall={} F1=".format(precision, recall))
Precision=0.0 Recall=0.0 F1=
>>>
>>> f1 = (2.0 * precision * recall) / (precision + recall + 0.0001)
>>>
>>> print("Precision={} Recall={} F1={}".format(precision, recall, f1))
Precision=0.0 Recall=0.0 F1=0.0