Skip to content

Latest commit

 

History

History
45 lines (33 loc) · 4.36 KB

paper.md

File metadata and controls

45 lines (33 loc) · 4.36 KB
title tags authors affiliations date bibliography
ER-Evaluation: End-to-End Evaluation of Entity Resolution Systems
Python
Entity Resolution
Evaluation
name orcid corresponding affiliation
Olivier Binette
0000-0001-6009-5206
true
1
name orcid corresponding affiliation
Jerome P. Reiter
0000-0002-8374-3832
false
1
name index
Duke University, USA
1
May 6, 2023
paper.bib

Summary

Entity resolution (ER), also referred to as record linkage and deduplication, is the process of identifying and matching distinct representations of real-world entities across diverse data sources. It plays a crucial role in data management, cleaning, and integration, with applications such as assessing the accuracy of the decennial census, detecting fraud, linking patient data in healthcare, and extracting relationships in structured and unstructured data [@christen2012; @christophides2019; @papadakis2021; @binette2022a].

As ER techniques continue to evolve and improve, it is essential to have an efficient and comprehensive evaluation framework to measure their performance and compare different approaches. Despite the growth of ER research, there remains a need for a unified evaluation framework that can address challenges associated with ER system evaluation, including accounting for sampling biases and managing class imbalances. Otherwise, using naive clustering metrics and toy benchmark datasets without a principled evaluation methodology generally leads to over-optimistic results that can lead to performance rank reversals and poor system design [@wang2022; @binette2022b].

ER-Evaluation is a Python 3.7+ package designed to address these challenges by implementing all components of a principled evaluation framework for ER systems. It incorporates principled statistical estimators for key performance metrics and summary statistics, error analysis tools, data labeling tools, and data visualizations. The package is written in Python with a simple architecture, ensuring straightforward portability to other languages and frameworks when necessary.

Additionally, ER-Evaluation adopts a novel entity-centric approach that uses disambiguated entity clusters as the foundation for analysis. This contrasts with traditional evaluation methods based on labeling record pairs [@marchant2017]. The entity-centric approach streamlines the utilization of existing benchmark datasets and the labeling of new datasets without necessitating complex sampling schemes. Furthermore, it enables the reuse of benchmark datasets at all stages of the evaluation process, including for cluster-level error analysis.

Statement of need

Entity resolution is a clustering problem characterized by small and numerous clusters (up to millions or billions of clusters). Researchers commonly evaluate the performance of entity resolution systems by computing performance metrics (precision, recall, cluster metrics) on relatively small benchmark datasets. However, this process has been shown to yield biased and over-optimistic performance assessments in ER, potentially leading to performance rank reversals and poor system design [@wang2022; @binette2022b].

To address this issue, a new entity-centric methodology has been proposed in @binette2022b for obtaining accurate performance metric estimates based on small and potentially biased benchmark datasets. The ER-Evaluation package implements this methodology and numerous extensions to create a comprehensive, end-to-end evaluation framework. It aims to streamline the comparison of diverse ER techniques, assess their accuracy, and ultimately accelerate the development and adoption of high-performing ER systems. By integrating essential components such as data preprocessing, error analysis, performance estimation, and visualization functions, ER-Evaluation offers a user-friendly, modular, and extensible interface for researchers and practitioners.

The software is currently being used by PatentsView.org for the evaluation of patent inventor name disambiguation [@binette2022c]. The original methodology has been published in @binette2022b and extended methodology is under development in an upcoming article [@binette2023].

Acknowledgements

We acknowledge financial support from the National Sciences and Engineering Research Council of Canada and from the Fonds de Recherche du Québec - Nature et Technologies.

References