Replies: 3 comments 2 replies
-
Thanks @jstammers, this is definitely a step that everyone using the FS model is going to have to use. I don't have a lot of experience with it, since in my app that uses mismo, I switched from using FS to simple hardcoded rules for better explainability. But I want to support this experience better. I assume you using FS, and using one of the above methods? What sort of labeled data do you have? One of big benefits of the FS model (at least in my eyes, there are other great properties as well) is that it can be trained in an unsupervised way using expectation maximization (relying on the assumption that "most record pairs should either strongly agree, or strongly disagree, but not many will have medium agreement"). This is what splink does. I would love to make it so that this threshold choosing step could also be unsupervised, so you never have to use labeled data in any step. Do you know of any methods for choosing a threshold in an unsupervised manner? Perhaps using a similar EM technique to let the machine determine which pairs are true- and false- matches? The other thing to watch out for here is that the characteristics of your domain really have an impact on your precision and recall: So if you are working in
So I want to do more research, but I wonder if it would be better/possible to use CLUSTER metrics instead of PAIR measurements to determine the threshold, since (I think this is a good assumption?) the next step after everyone filters their pairs is to run some sort of clustering algorithm eg connected components. |
Beta Was this translation helpful? Give feedback.
-
To explain my app a little further, I'm looking to de-duplicate records of stores that have standard fields (Name, Address components, Phone number, website, lat/long etc.) I have access to some labelled duplicates that have either been manually inspected or via an automated process that I've yet to fully understand. Later on, I'm looking to implement a model to link from one (messy) dataset to another (clean) dataset, but we have automated rules for that so it's less of a priority. In both cases, we expect dupes to be fairly unlikely and matches to be (close to) one-to-one, so the data are definitely in the A category you've described where a high precision is likely, but recall is the main objective. I agree that having an unsupervised method to determine an optimum threshold would be very useful. I'm less familiar with cluster metrics and given my relatively large amount of labelled data, I decided to go for the quickest approach of using a set of known labels as follows weights = train_using_labels(comparers, table, table, ...)
blocked = blocker(table)
compared = compare(blocked)
scored = weights.score_compared(compared)
optimum_threshold = find_optimum_threshold(scored, table, method="pr")
links = scored.filter(_.odds > optimum_threshold).select("record_id_l", "record_id_r")
clusters = connected_components(links=links.cache(), records=table.cache())
labels_pred = clusters.rename({"label": "component"})
labels_true = table.rename({"label": "label_true"})
evaluate(labels_true=labels_true, labels_pred=labels_pred) where I'm not too familiar with many unsupervised algorithms to determine an optimum, but if we had a measure of the distance between records then something like the silhouette score could work. For text fields, we could use something like It's not clear to me if we would need to compute the distances between every pair of records, or if we can just use the pairs that have been blocked together. |
Beta Was this translation helpful? Give feedback.
-
Perhaps another way to think about this is to take inspiration from this exercise that describes an algorithm to label each tuple of comparison levels as a Practically speaking, it seems better to be able to define a region of uncertainty which would require some manual review. @NickCrews I'd be interested to hear your thoughts on whether it's worth implementing this in def classify_compared(self, table: ir.Table, fpr: float, fnr: float):
"""Classify compared pairs into `match`, `possible match` and `unmatch` according to error rate tolerances""" |
Beta Was this translation helpful? Give feedback.
-
After training a FS model and scoring candidate pairs, there remains a manual task of determining a threshold above which pairs of records are assigned to the same cluster. It could be useful to have a method of learning an optimal threshold using labelled examples.
One such way to do this is to make use of ROC curve analysis as follows
Different methods existing for determining the optimal threshold, but perhaps the simplest is to find the threshold that maximises the difference between the True Positive Rate and False Positive Rate
In the case of imbalanced datasets, which can be the case if many candidate pairs are blocked, the precision-recall curve may be more informative
So the optimal threshold could be determined from the F1-score - the harmonic mean of the precision and recall
Beta Was this translation helpful? Give feedback.
All reactions