Datasets listed in this page were used for the experimental study in Deep Learning for Entity Matching published in SIGMOD 2018. Each data instance in each dataset is a labeled tuple pair, where each tuple pair comes from the 2 tables being matched, say table A and table B. We assume that both the tables being matched have the same schema.
The table below summarizes all the datasets. Here's a brief description of some of the columns:
- Size: Number of labeled tuple pairs in the dataset.
- # Pos.: Number of positive instances, i.e., tuple pairs marked as a match in the dataset.
- # Attr.: Number of attributes in the tables being matched (note that both tables have same schema)
The "Browse" links point to the preprocessed versions of the datasets which were used for experiments. The "Download" links provide a compressed zip of the preprocessed data. The "Raw" links provide a compressed zip of the raw unprocessed data obtained from data source. The "Info" links point to the dataset details.
Type | Dataset | Domain | Size | # Pos. | # Attr. | |
---|---|---|---|---|---|---|
Structured | BeerAdvo-RateBeer | beer | 450 | 68 | 4 | Browse | Download | Raw | Info |
iTunes-Amazon1 | music | 539 | 132 | 8 | Browse | Download | Raw | Info | |
Fodors-Zagats | restaurant | 946 | 110 | 6 | Browse | Download | Raw | Info | |
DBLP-ACM1 | citation | 12,363 | 2,220 | 4 | Browse | Download | Raw | Info | |
DBLP-Scholar1 | citation | 28,707 | 5,347 | 4 | Browse | Download | Raw | Info | |
Amazon-Google | software | 11,460 | 1,167 | 3 | Browse | Download | Raw | Info | |
Walmart-Amazon1 | electronics | 10,242 | 962 | 5 | Browse | Download | Raw | Info | |
Textual | Abt-Buy | product | 9,575 | 1,028 | 3 | Browse | Download | Raw | Info |
Company | company | 112,632 | 28,200 | 1 | Browse | Download | Raw | Info | |
Dirty | iTunes-Amazon2 | music | 539 | 132 | 8 | Browse | Download | Info |
DBLP-ACM2 | citation | 12,363 | 2,220 | 4 | Browse | Download | Info | |
DBLP-Scholar2 | citation | 28,707 | 5,347 | 4 | Browse | Download | Info | |
Walmart-Amazon2 | electronics | 10,242 | 962 | 5 | Browse | Download | Info |
Notes:
- The
tableA.csv
andtableB.csv
files in the provided experimental data may not directly correspond to the original tables being matched. You can think oftableA.csv
as containing all the "left" tuples andtableB.csv
as containing all the "right" tuples. This is done so as to distribute the data in a reasonably compact but readable form. - The dirty EM datasets were generated from the preprocessed versions of the corresponding structured EM datasets. As a result there are no raw versions of the dirty datasets.
- Download all preprocessed structured datasets
- Download all preprocessed textual datasets
- Download all preprocessed dirty datasets
This dataset contains beer data from BeerAdvocate and RateBeer and was obtained from here. It was created by students in the CS 784 data science class at UW-Madison, Fall 2015, as a part of their class project. To create the data set, students
- Crawled HTML pages from the two websites
- Extracted tuples from the HTML pages to create two tables, one per site
- Performed blocking on these tables (to remove obviously non-matched tuple pairs), producing a set of candidate tuple pairs
- Took a random sample of pairs from the above set and labeled the pairs in the sample as "match" / "non-match".
For the purpose of performing experiments for this work, we split the set of labeled tuple pairs into 3 sub-sets, i.e., train, validation, and test, with ratio 3:1:1.
This dataset contains music data from iTunes and Amazon and was obtained from here. This was also created by students in the CS 784 data science class at UW-Madison. The dataset was created in the same manner as BeerAdvo-RateBeer.
This dataset contains restaurant data from Fodors and from Zagat and was obtained from here. The original dataset obtained from the source contained two tables, one each for Fodors and Zagat, and a list of golden matches indicating which tuple pairs referred to the same restaurant. To create the version of the dataset used in our experiments which contain both matches and non-matches, we use the following procudere:
- Given the two tables (tableA.csv & tableB.csv), perform dataset specific blocking to obtain a candidate set C
- For each tuple pair in set C, if the pair is present in the golden matches file (gold.csv), mark the pair as a match. Else, mark the pair as a non-match.
- Randomly split the labeled candidate set C into 3 sets, i.e., train, validation, and test, with ratio 3:1:1.
This dataset contains bibliographic data from DBLP and ACM and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.
This dataset contains bibliographic data from DBLP and Google Scholar and was obtained from here. The original dataset obtained from the source contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.
This dataset contains product data from Amazon and Google and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "description" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.
This dataset contains product data from Walmart and Amazon and was obtained from here. The original dataset contained two tables, and a list of golden matches. Further, the original dataset contained one additional attribute "proddescrlong" which contained long blobs of text. This attribute was removed so as to use this as a structured dataset. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.
This dataset contains product data from Abt.com and Buy.com and was obtained from here. The original dataset contained two tables, and a list of golden matches. To create the version of the dataset used in our experiments we used the same procedure as in the case of Fodors-Zagats.
This dataset consists of pairs (a,b)
, where a
is the text of a Wikipedia page describing a company and b
is the text of a company’s homepage. We created matching pairs in this dataset by crawling Wikipedia pages describing companies, then following company URLs in those pages to retrieve company homepages. To generate the non-matching pairs, for each matching pair (a,b)
, we fix a
and form three negative pairs (a,b1)
, (a,b2)
, and (a,b3)
where b1, b2, and b3 are the top-3 most similar pages other than b
in the company homepage collection, calculated based on Okapi BM25 rankings.
In the case of company home pages and wikipedia pages, the title of the document often contains the name of the company. Further the document often begins by mentioning the company name. In general matching web documents in the wild is a complex task and such hints are unlikely to be present. To simulate a general web document matching scenario, we remove the first 20 tokens from the parsed HTML of each document and consider only the rest of the document for the purpose of matching. This essentially removes the document title and also any mentions of the company name from the beginning of the first paragraph.
This dataset contains music data from iTunes and Amazon and was obtained by modifying the structured iTunes-Amazon dataset to simulate dirty data. Specifically, for each attribute other than "title", we randomly moved each value to the attribute "title" in the same tuple with 50% probability. This simulates a common kind of dirty data seen in the wild while keeping the modifications simple.
This dataset contains bibliographic data from DBLP and ACM and was obtained by modifying the structured DBLP-ACM dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.
This dataset contains bibliographic data from DBLP and Google Scholar and was obtained by modifying the structured DBLP-Scholar dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.
This dataset contains product data from Walmart and Amazon and was obtained by modifying the structured Walmart-Amazon dataset to simulate dirty data. The procedure for generating this dataset is the same as that for dirty iTunes-Amazon.