Replies: 3 comments 1 reply
-
@NickCrews, sorry to bother, could you just confirm whether having pairs of records labelled as "match" or "non-match" can be useful and used for training? |
Beta Was this translation helpful? Give feedback.
-
Sorry for the slow response. So at the end of the day, the fellegi sunter model operates on pairs, at no point does it deal with clusters. So what happens when you supply labeled records is
So, you can bypass step 1 and do steps 2 and 3 yourself. If you look at the code it shouldn't be too complex??? Once you have m and u, I think you should be good to go (if not, the API needs to be tweaked). That is just a workaround though. It should be built-in to be able to go labeled pairs to model, something like train_from_labeled_pairs. Open questions there:
More philosophically, you are right that translating between cluster labels and pair labels is semantically tricky. Same thing with going the other direction, eg using connected components is only one possible algorithm out of literally infinite you could use. There are many fewer reasonable algorithms for going cluster to pairs, eg I think the one we use here is the only reasonable one? But still, the transitive link issue is still a problem. This ambiguity is why we should support training from pairs labels, not just cluster labels. |
Beta Was this translation helpful? Give feedback.
-
Quick report about the differences between training with EM and training using manually labelled pairs (see PR #73) |
Beta Was this translation helpful? Give feedback.
-
I am playing with mismo to deduplicate postal addresses and I would like to try training the Fellegi-Sunter model using manually labelled pairs of addresses.
Looking at the documentation of
mismo.fs._train.train_ms_from_labels
the input table holding my records must have a column namedlabel_true
. To extract true matches the code joins the input table with itself using thelabel_true
column, however, this approach does not fit with the procedure that generated my manually labelled pairs.To manually label my pairs, I display humans a pair of address and they can either mark the pair as a "match" or a "non-match" (or skip it). This means that the following situation may happen:
i.e., being a match is not a transitive relation when data is acquired in this way!
Although this may be considered as an issue with the procedure used to acquire match/non-match information, it is probably the most simple way to acquire this kind of information (that I implemented back when I was experimenting with the dedupe library).
However, storing labelled pairs in this way does not fit well with the join performed in mismo which implies the transitivity of the "match" relationship.
The only thing I can think of to re-use my labelled pairs in mismo is to first partition my addresses into clusters containing equivalent addresses, associate a unique key to each cluster and use such key in the
label_true
column... which is non trivial.E.g. supposing that my dataset is made only of addresses A, B and C, I should decide to partition them in one of the following ways:
{A, B}, {C}
{A, C}, {B}
{A, B, C}
Questions:
Beta Was this translation helpful? Give feedback.
All reactions