SchemaMatching

Jump to bottom

olehmberg edited this page Apr 20, 2017 · 5 revisions

Schema Matching

Label-based schema matching

Label-based schema matching aligns the attributes of two datasets based on their names (labels).

Load two datasets from CSV files using the default model

// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader(0).loadFromCSV(new File("scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader(0).loadFromCSV(new File("scifi2.csv"), data2);

Initialise the matching engine and run the label-based schema matching

// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();

Processable<Correspondence<Attribute, Record>> correspondences
  = engine.runLabelBasedSchemaMatching(data1.getSchema(), data2.getSchema(), new LabelComparatorJaccard(), 0.5);

The result is the alignment/mapping represented by correspondences between the attributes

// print results
for(Correspondence<Attribute, Record> cor : correspondences.get()) {
	System.out.println(String.format("'%s' <-> '%s' (%.4f)", cor.getFirstRecord().getName(), cor.getSecondRecord().getName(), cor.getSimilarityScore()));
}

Instance-based schema matching

Instance-based schema matching aligns the attributes of two datasets based on their values.

Load two datasets from CSV files using the default model

// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader(-1).loadFromCSV(new File("usecase/movie/input/scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader(-1).loadFromCSV(new File("usecase/movie/input/scifi2.csv"), data2);

Initialise the matching engine

// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();

Define a blocker that uses the attribute values to create pairs potentially matching attributes.

// define a blocker that uses the attribute values to generate pairs
InstanceBasedSchemaBlocker<Record, Attribute, MatchableValue> blocker
  = new InstanceBasedSchemaBlocker<>(
    new AttributeValueGenerator(data1.getSchema()),
    new AttributeValueGenerator(data2.getSchema()));

Define an aggregator that calculates a similarity score based on all matching values between an attribute combination

// to calculate the similarity score, aggregate the pairs by counting and normalise with the number of record in the smaller dataset (= the maximum number of records that can match)
VotingAggregator<Attribute, MatchableValue> aggregator
  = new VotingAggregator<>(
    false,
    Math.min(data1.size(), data2.size()),
    0.0);

Run the instance-based schema matching via the matching engine

// run the matching
Processable<Correspondence<Attribute, MatchableValue>> correspondences
= engine.runInstanceBasedSchemaMatching(data1, data2, blocker, aggregator);

Finally, print the results to the console

// print results
for(Correspondence<Attribute, MatchableValue> cor : correspondences.get()) {
  System.out.println(String.format("'%s' <-> '%s' (%.4f)",
    cor.getFirstRecord().getName(),
    cor.getSecondRecord().getName(),
    cor.getSimilarityScore()));
}

Duplicate-based schema matching

Load two datasets with different schemas and overlapping records.

// load data
DataSet<Record, Attribute> data1 = new HashedDataSet<>();
new CSVRecordReader().loadFromCSV(new File("usecase/movie/input/scifi1.csv"), data1);
DataSet<Record, Attribute> data2 = new HashedDataSet<>();
new CSVRecordReader().loadFromCSV(new File("usecase/movie/input/scifi2.csv"), data2);

Create a set of duplicates. For simplicity, assume the duplicates have the same ID value in both datasets.

// create duplicates based on the record id (first column in both files)
LinearCombinationMatchingRule<Record, Attribute> duplicateRule = new LinearCombinationMatchingRule<>(1.0);
duplicateRule.addComparator((r1,r2,c) -> r1.getValue(data1.getAttribute("0")).equals(r2.getValue(data2.getAttribute("0"))) ? 1.0 : 0.0, 1.0);

Initialise the matching engine and create the duplicates

// Initialize Matching Engine
MatchingEngine<Record, Attribute> engine = new MatchingEngine<>();

// create the duplicates
Result<Correspondence<Record, Attribute>> duplicates = engine.runIdentityResolution(data1, data2, null, duplicateRule, new NoBlocker<>());

Define the rule for duplicate-based schema matching. Here, the similarity function only accepts exact matches.

// define the schema matching rule
SchemaMatchingRuleWithVoting<Record, Attribute, Attribute> schemaRule = new DuplicateBasedSchemaMatchingRule<>(
	(a1,a2,c) -> {
		String value1 = c.getFirstRecord().getValue(a1);
		String value2 = c.getSecondRecord().getValue(a2);

		if(value1!=null && value2!=null && value1.equals(value2)) {
			return 1.0;
		} else {
			return 0.0;
		}
	}
, 1.0);

Run the matching with the just defined rule.

// Execute the matching
Result<Correspondence<Attribute, Record>> correspondences = engine.runDuplicateBasedSchemaMatching(data1.getSchema(), data2.getSchema(), duplicates, schemaRule);

And print the results to the console

// print results
for(Correspondence<Attribute, Record> cor : correspondences.get()) {
	System.out.println(
String.format("'%s' <-> '%s' (%.4f)",
cor.getFirstRecord().getName(),
cor.getSecondRecord().getName(),
cor.getSimilarityScore()));
}