Train using manually labelled pairs #58

lmores · 2024-09-06T09:08:16Z

lmores
Sep 6, 2024

I am playing with mismo to deduplicate postal addresses and I would like to try training the Fellegi-Sunter model using manually labelled pairs of addresses.
Looking at the documentation of mismo.fs._train.train_ms_from_labels the input table holding my records must have a column named label_true. To extract true matches the code joins the input table with itself using the label_true column, however, this approach does not fit with the procedure that generated my manually labelled pairs.

To manually label my pairs, I display humans a pair of address and they can either mark the pair as a "match" or a "non-match" (or skip it). This means that the following situation may happen:

address A and address B are marked as a match
address A and address C are marked as a match
address B and address C are marked as a non-match

i.e., being a match is not a transitive relation when data is acquired in this way!
Although this may be considered as an issue with the procedure used to acquire match/non-match information, it is probably the most simple way to acquire this kind of information (that I implemented back when I was experimenting with the dedupe library).

However, storing labelled pairs in this way does not fit well with the join performed in mismo which implies the transitivity of the "match" relationship.
The only thing I can think of to re-use my labelled pairs in mismo is to first partition my addresses into clusters containing equivalent addresses, associate a unique key to each cluster and use such key in the label_true column... which is non trivial.
E.g. supposing that my dataset is made only of addresses A, B and C, I should decide to partition them in one of the following ways:

{A, B}, {C}
{A, C}, {B}
{A, B, C}

Questions:

Am I right? Is my set of labelled pairs not suited to be used by mismo?
Does mismo can take advantage of the information that a pair is a non-match or is it interested just in the matching pairs?

lmores · 2024-09-27T10:35:16Z

lmores
Sep 27, 2024
Author

@NickCrews, sorry to bother, could you just confirm whether having pairs of records labelled as "match" or "non-match" can be useful and used for training?

0 replies

NickCrews · 2024-09-27T17:17:42Z

NickCrews
Sep 27, 2024
Maintainer

Sorry for the slow response. So at the end of the day, the fellegi sunter model operates on pairs, at no point does it deal with clusters. So what happens when you supply labeled records is

These are converted to match/no match pairs
These pairs are fed through the matcher to find the match level
Now we have both match level and true label, we can calculate the m and u parameters.

So, you can bypass step 1 and do steps 2 and 3 yourself. If you look at the code it shouldn't be too complex??? Once you have m and u, I think you should be good to go (if not, the API needs to be tweaked).

That is just a workaround though. It should be built-in to be able to go labeled pairs to model, something like train_from_labeled_pairs. Open questions there:

do we track non-matches explicitly, or do we only pass matches, and anything not passed is assumed to be non-match? The number of non-matches in general is O(n**2), so we CANT pass all of them. Do we only want the user to pass non-matches from post-blocking? Then the number of non-matches would be tenable, and it would also be more representative of the pairs that the comparison function would see.

More philosophically, you are right that translating between cluster labels and pair labels is semantically tricky. Same thing with going the other direction, eg using connected components is only one possible algorithm out of literally infinite you could use. There are many fewer reasonable algorithms for going cluster to pairs, eg I think the one we use here is the only reasonable one? But still, the transitive link issue is still a problem. This ambiguity is why we should support training from pairs labels, not just cluster labels.

1 reply

lmores Oct 11, 2024
Author

If you look at the code it shouldn't be too complex???

I'll try and let you know (sounds easy enough).

About the open question, I think we should track non-matches explicitly for the following reason: suppose you have a medium/big/huge dataset of records, you initially train mismo with EM but then you want to refine/improve the weights. You could ask the user to manually label some of the pairs which are most dubious (which cannot be the whole set of all possible pairs). In this case we need to record both matching and non-matching pairs (and all the other pairs will be unknown). I am thinking in this direction as I am a past user of Dedupe and this mechanism is at its core. Does this make sense also for mismo?

Going from clusters to pairs by considering each pair within the same cluster as a match and all the others as a non-match seems very reasonable! (Of the top of my head, I can't say if this is what is currently implemented in mismo, but it seems the most obvious thing).

As for the reverse direction (going from pairs to clusters) I agree that connected components is just one of the many possible approach and its bigger drawback is the creation of potentially very large components with many non-transitive pairs within it. However, it should be just fine for the time being.

lmores · 2024-10-18T10:34:24Z

lmores
Oct 18, 2024
Author

Quick report about the differences between training with EM and training using manually labelled pairs (see PR #73)

Train using EM:

Train using manually labelled pairs:

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train using manually labelled pairs #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Train using manually labelled pairs #58

lmores Sep 6, 2024

Replies: 3 comments · 1 reply

lmores Sep 27, 2024 Author

NickCrews Sep 27, 2024 Maintainer

lmores Oct 11, 2024 Author

lmores Oct 18, 2024 Author

lmores
Sep 6, 2024

Replies: 3 comments 1 reply

lmores
Sep 27, 2024
Author

NickCrews
Sep 27, 2024
Maintainer

lmores Oct 11, 2024
Author

lmores
Oct 18, 2024
Author