Datasets for DeepMatcher paper

Datasets listed in this page were used for the experimental study in Deep Learning for Entity Matching published in SIGMOD 2018. Each data instance in each dataset is a labeled tuple pair, where each tuple pair comes from the 2 tables being matched, say table A and table B. We assume that both the tables being matched have the same schema.

The table below summarizes all the datasets. Here's a brief description of some of the columns:

Size: Number of labeled tuple pairs in the dataset.
# Pos.: Number of positive instances, i.e., tuple pairs marked as a match in the dataset.
# Attr.: Number of attributes in the tables being matched (note that both tables have same schema)

The "Browse" links point to the preprocessed versions of the datasets which were used for experiments. The "Download" links provide a compressed zip of the preprocessed data. The "Raw" links provide a compressed zip of the raw unprocessed data obtained from data source. The "Info" links point to the dataset details.

Type	Dataset	Domain	Size	# Pos.	# Attr.
Structured	BeerAdvo-RateBeer	beer	450	68	4	Browse \| Download \| Raw \| Info
	iTunes-Amazon₁	music	539	132	8	Browse \| Download \| Raw \| Info
	Fodors-Zagats	restaurant	946	110	6	Browse \| Download \| Raw \| Info
	DBLP-ACM₁	citation	12,363	2,220	4	Browse \| Download \| Raw \| Info
	DBLP-Scholar₁	citation	28,707	5,347	4	Browse \| Download \| Raw \| Info
	Amazon-Google	software	11,460	1,167	3	Browse \| Download \| Raw \| Info
	Walmart-Amazon₁	electronics	10,242	962	5	Browse \| Download \| Raw \| Info
Textual	Abt-Buy	product	9,575	1,028	3	Browse \| Download \| Raw \| Info
Textual	Company	company	112,632	28,200	1	Browse \| Download \| Raw \| Info
Dirty	iTunes-Amazon₂	music	539	132	8	Browse \| Download \| Info
	DBLP-ACM₂	citation	12,363	2,220	4	Browse \| Download \| Info
	DBLP-Scholar₂	citation	28,707	5,347	4	Browse \| Download \| Info
	Walmart-Amazon₂	electronics	10,242	962	5	Browse \| Download \| Info

Notes:

The tableA.csv and tableB.csv files in the provided experimental data may not directly correspond to the original tables being matched. You can think of tableA.csv as containing all the "left" tuples and tableB.csv as containing all the "right" tuples. This is done so as to distribute the data in a reasonably compact but readable form.
The dirty EM datasets were generated from the preprocessed versions of the corresponding structured EM datasets. As a result there are no raw versions of the dirty datasets.

Batch download links:

Preprocessed Data

Download all preprocessed structured datasets
Download all preprocessed textual datasets
Download all preprocessed dirty datasets

Raw Data

Download all raw structured datasets
Download all raw textual datasets

Dataset Details

Structured

BeerAdvo-RateBeer

This dataset contains beer data from BeerAdvocate and RateBeer and was obtained from here. It was created by students in the CS 784 data science class at UW-Madison, Fall 2015, as a part of their class project. To create the data set, students

Crawled HTML pages from the two websites
Extracted tuples from the HTML pages to create two tables, one per site
Performed blocking on these tables (to remove obviously non-matched tuple pairs), producing a set of candidate tuple pairs
Took a random sample of pairs from the above set and labeled the pairs in the sample as "match" / "non-match".

For the purpose of performing experiments for this work, we split the set of labeled tuple pairs into 3 sub-sets, i.e., train, validation, and test, with ratio 3:1:1.

iTunes-Amazon

This dataset contains music data from iTunes and Amazon and was obtained from here. This was also created by students in the CS 784 data science class at UW-Madison. The dataset was created in the same manner as BeerAdvo-RateBeer.

Fodors-Zagats

This dataset contains restaurant data from Fodors and from Zagat and was obtained from here. The original dataset obtained from the source contained two tables, one each for Fodors and Zagat, and a list of golden matches indicating which tuple pairs referred to the same restaurant. To create the version of the dataset used in our experiments which contain both matches and non-matches, we use the following procudere:

Given the two tables (tableA.csv & tableB.csv), perform dataset specific blocking to obtain a candidate set C
For each tuple pair in set C, if the pair is present in the golden matches file (gold.csv), mark the pair as a match. Else, mark the pair as a non-match.
Randomly split the labeled candidate set C into 3 sets, i.e., train, validation, and test, with ratio 3:1:1.

DBLP-ACM

This dataset contains bibliographic data from DBLP and ACM and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

DBLP-Scholar

This dataset contains bibliographic data from DBLP and Google Scholar and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Amazon-Google

This dataset contains product data from Amazon and Google and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "description" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Walmart-Amazon

This dataset contains product data from Walmart and Amazon and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "proddescrlong" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Textual

Abt-Buy

This dataset contains product data from Abt.com and Buy.com and was obtained from here. The original dataset contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.

Company

This dataset consists of pairs (a,b), where a is the text of a Wikipedia page describing a company and b is the text of a company’s homepage. We created matching pairs in this dataset by crawling Wikipedia pages describing companies, then following company URLs in those pages to retrieve company homepages. To generate the non-matching pairs, for each matching pair (a,b), we fix a and form three negative pairs (a,b₁), (a,b₂), and (a,b₃) where b₁, b₂, and b₃ are the top-3 most similar pages other than b in the company homepage collection, calculated based on Okapi BM25 rankings.

In the case of company home pages and wikipedia pages, the title of the document often contains the name of the company. Further the document often begins by mentioning the company name. In general matching web documents in the wild is a complex task and such hints are unlikely to be present. To simulate a general web document matching scenario, we remove the first 20 tokens from the parsed HTML of each document and consider only the rest of the document for the purpose of matching. This essentially removes the document title and also any mentions of the company name from the beginning of the first paragraph.

Dirty

iTunes-Amazon

This dataset contains music data from iTunes and Amazon and was obtained by modifying the structured iTunes-Amazon dataset to simulate dirty data. Specifically, for each attribute other than "title", we randomly moved each value to the attribute "title" in the same tuple with 50% probability. This simulates a common kind of dirty data seen in the wild while keeping the modifications simple.

DBLP-ACM

This dataset contains bibliographic data from DBLP and ACM and was obtained by modifying the structured DBLP-ACM dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.

DBLP-Scholar

This dataset contains bibliographic data from DBLP and Google Scholar and was obtained by modifying the structured DBLP-Scholar dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.

Walmart-Amazon

This dataset contains product data from Walmart and Amazon and was obtained by modifying the structured Walmart-Amazon dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets.md

Datasets.md

Datasets for DeepMatcher paper

Batch download links:

Preprocessed Data

Raw Data

Dataset Details

Structured

BeerAdvo-RateBeer

iTunes-Amazon

Fodors-Zagats

DBLP-ACM

DBLP-Scholar

Amazon-Google

Walmart-Amazon

Textual

Abt-Buy

Company

Dirty

iTunes-Amazon

DBLP-ACM

DBLP-Scholar

Walmart-Amazon

Files

Datasets.md

Latest commit

History

Datasets.md

File metadata and controls

Datasets for DeepMatcher paper

Batch download links:

Preprocessed Data

Raw Data

Dataset Details

Structured

BeerAdvo-RateBeer

iTunes-Amazon

Fodors-Zagats

DBLP-ACM

DBLP-Scholar

Amazon-Google

Walmart-Amazon

Textual

Abt-Buy

Company

Dirty

iTunes-Amazon

DBLP-ACM

DBLP-Scholar

Walmart-Amazon