Skip to content

Latest commit

 

History

History
254 lines (214 loc) · 16.3 KB

Datasets.md

File metadata and controls

254 lines (214 loc) · 16.3 KB

Datasets for DeepMatcher paper

Datasets listed in this page were used for the experimental study in Deep Learning for Entity Matching published in SIGMOD 2018. Each data instance in each dataset is a labeled tuple pair, where each tuple pair comes from the 2 tables being matched, say table A and table B. We assume that both the tables being matched have the same schema.

The table below summarizes all the datasets. Here's a brief description of some of the columns:

  • Size: Number of labeled tuple pairs in the dataset.
  • # Pos.: Number of positive instances, i.e., tuple pairs marked as a match in the dataset.
  • # Attr.: Number of attributes in the tables being matched (note that both tables have same schema)

The "Browse" links point to the preprocessed versions of the datasets which were used for experiments. The "Download" links provide a compressed zip of the preprocessed data. The "Raw" links provide a compressed zip of the raw unprocessed data obtained from data source. The "Info" links point to the dataset details.

Type Dataset Domain Size # Pos. # Attr.
Structured BeerAdvo-RateBeer beer 450 68 4 Browse | Download | Raw | Info
iTunes-Amazon1 music 539 132 8 Browse | Download | Raw | Info
Fodors-Zagats restaurant 946 110 6 Browse | Download | Raw | Info
DBLP-ACM1 citation 12,363 2,220 4 Browse | Download | Raw | Info
DBLP-Scholar1 citation 28,707 5,347 4 Browse | Download | Raw | Info
Amazon-Google software 11,460 1,167 3 Browse | Download | Raw | Info
Walmart-Amazon1 electronics 10,242 962 5 Browse | Download | Raw | Info
Textual Abt-Buy product 9,575 1,028 3 Browse | Download | Raw | Info
Company company 112,632 28,200 1 Browse | Download | Raw | Info
Dirty iTunes-Amazon2 music 539 132 8 Browse | Download | Info
DBLP-ACM2 citation 12,363 2,220 4 Browse | Download | Info
DBLP-Scholar2 citation 28,707 5,347 4 Browse | Download | Info
Walmart-Amazon2 electronics 10,242 962 5 Browse | Download | Info

Notes:

  • The tableA.csv and tableB.csv files in the provided experimental data may not directly correspond to the original tables being matched. You can think of tableA.csv as containing all the "left" tuples and tableB.csv as containing all the "right" tuples. This is done so as to distribute the data in a reasonably compact but readable form.
  • The dirty EM datasets were generated from the preprocessed versions of the corresponding structured EM datasets. As a result there are no raw versions of the dirty datasets.

Batch download links:

Preprocessed Data

Raw Data

Dataset Details

Structured

BeerAdvo-RateBeer

This dataset contains beer data from BeerAdvocate and RateBeer and was obtained from here. It was created by students in the CS 784 data science class at UW-Madison, Fall 2015, as a part of their class project. To create the data set, students

  1. Crawled HTML pages from the two websites
  2. Extracted tuples from the HTML pages to create two tables, one per site
  3. Performed blocking on these tables (to remove obviously non-matched tuple pairs), producing a set of candidate tuple pairs
  4. Took a random sample of pairs from the above set and labeled the pairs in the sample as "match" / "non-match".

For the purpose of performing experiments for this work, we split the set of labeled tuple pairs into 3 sub-sets, i.e., train, validation, and test, with ratio 3:1:1.

iTunes-Amazon

This dataset contains music data from iTunes and Amazon and was obtained from here. This was also created by students in the CS 784 data science class at UW-Madison. The dataset was created in the same manner as BeerAdvo-RateBeer.

Fodors-Zagats

This dataset contains restaurant data from Fodors and from Zagat and was obtained from here. The original dataset obtained from the source contained two tables, one each for Fodors and Zagat, and a list of golden matches indicating which tuple pairs referred to the same restaurant. To create the version of the dataset used in our experiments which contain both matches and non-matches, we use the following procudere:

  1. Given the two tables (tableA.csv & tableB.csv), perform dataset specific blocking to obtain a candidate set C
  2. For each tuple pair in set C, if the pair is present in the golden matches file (gold.csv), mark the pair as a match. Else, mark the pair as a non-match.
  3. Randomly split the labeled candidate set C into 3 sets, i.e., train, validation, and test, with ratio 3:1:1.

DBLP-ACM

This dataset contains bibliographic data from DBLP and ACM and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

DBLP-Scholar

This dataset contains bibliographic data from DBLP and Google Scholar and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Amazon-Google

This dataset contains product data from Amazon and Google and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "description" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Walmart-Amazon

This dataset contains product data from Walmart and Amazon and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "proddescrlong" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Textual

Abt-Buy

This dataset contains product data from Abt.com and Buy.com and was obtained from here. The original dataset contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Company

This dataset consists of pairs (a,b), where a is the text of a Wikipedia page describing a company and b is the text of a company’s homepage. We created matching pairs in this dataset by crawling Wikipedia pages describing companies, then following company URLs in those pages to retrieve company homepages. To generate the non-matching pairs, for each matching pair (a,b), we fix a and form three negative pairs (a,b1), (a,b2), and (a,b3) where b1, b2, and b3 are the top-3 most similar pages other than b in the company homepage collection, calculated based on Okapi BM25 rankings.

In the case of company home pages and wikipedia pages, the title of the document often contains the name of the company. Further the document often begins by mentioning the company name. In general matching web documents in the wild is a complex task and such hints are unlikely to be present. To simulate a general web document matching scenario, we remove the first 20 tokens from the parsed HTML of each document and consider only the rest of the document for the purpose of matching. This essentially removes the document title and also any mentions of the company name from the beginning of the first paragraph.

Dirty

iTunes-Amazon

This dataset contains music data from iTunes and Amazon and was obtained by modifying the structured iTunes-Amazon dataset to simulate dirty data. Specifically, for each attribute other than "title", we randomly moved each value to the attribute "title" in the same tuple with 50% probability. This simulates a common kind of dirty data seen in the wild while keeping the modifications simple.

DBLP-ACM

This dataset contains bibliographic data from DBLP and ACM and was obtained by modifying the structured DBLP-ACM dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.

DBLP-Scholar

This dataset contains bibliographic data from DBLP and Google Scholar and was obtained by modifying the structured DBLP-Scholar dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.

Walmart-Amazon

This dataset contains product data from Walmart and Amazon and was obtained by modifying the structured Walmart-Amazon dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.