Optimally determining a link probability threshold #52

jstammers · 2024-08-03T10:53:35Z

jstammers
Aug 3, 2024

After training a FS model and scoring candidate pairs, there remains a manual task of determining a threshold above which pairs of records are assigned to the same cluster. It could be useful to have a method of learning an optimal threshold using labelled examples.

One such way to do this is to make use of ROC curve analysis as follows

from sklearn.metrics import roc_curve, auc, precision_recall_curve

known_data = ibis.memtable({"record_id" : ..., "record_true":.., ...})

blocked = block(known_data)
compared = compare(blocked)
scored = weights.score_compared(compared)
scored = scored.mutate(
    prob=_.odds / (1 + _.odds),
    match_true=ibis.or_(
        _.record_id_l == _.record_true_r, _.record_id_r == _.record_true_l
    ),
)
true_labels = scored.match_true.execute()
predicted_probs = scored.prob.execute()

fpr, tpr, thresholds = roc_curve(true_labels, predicted_probs)

# plot the ROC curve
plt.figure()
plt.plot(fpr, tpr, color='darkorange', lw=2, label='ROC curve')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc='lower right')
plt.show()

roc_auc = auc(fpr, tpr)
print(f"Area Under the Curve (AUC): {roc_auc:.2f}")

Different methods existing for determining the optimal threshold, but perhaps the simplest is to find the threshold that maximises the difference between the True Positive Rate and False Positive Rate

optimal_idx = np.argmax(tpr - fpr)  # Index where TPR - FPR is maximized
optimal_threshold = thresholds[optimal_idx]
odds_threshold = optimal_threshold /  (1 - optimal_threshold) # invert p = o / (1 + o)

In the case of imbalanced datasets, which can be the case if many candidate pairs are blocked, the precision-recall curve may be more informative

precision, recall, pr_thresholds = precision_recall_curve(true_labels, predicted_probs)
plt.figure()
plt.plot(recall, precision, color='b', lw=2)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

So the optimal threshold could be determined from the F1-score - the harmonic mean of the precision and recall

f1_scores =  2 * (precision * recall) / (precision + recall)
optimal_idx = np.argmax(f1_scores)
optimal_threshold = pr_thresholds[optimal_idx]

NickCrews · 2024-08-06T17:54:46Z

NickCrews
Aug 6, 2024
Maintainer

Thanks @jstammers, this is definitely a step that everyone using the FS model is going to have to use. I don't have a lot of experience with it, since in my app that uses mismo, I switched from using FS to simple hardcoded rules for better explainability. But I want to support this experience better.

I assume you using FS, and using one of the above methods? What sort of labeled data do you have? One of big benefits of the FS model (at least in my eyes, there are other great properties as well) is that it can be trained in an unsupervised way using expectation maximization (relying on the assumption that "most record pairs should either strongly agree, or strongly disagree, but not many will have medium agreement"). This is what splink does. I would love to make it so that this threshold choosing step could also be unsupervised, so you never have to use labeled data in any step. Do you know of any methods for choosing a threshold in an unsupervised manner? Perhaps using a similar EM technique to let the machine determine which pairs are true- and false- matches?

The other thing to watch out for here is that the characteristics of your domain really have an impact on your precision and recall: So if you are working in

A: some domain with tiny clusters. Eg maybe it follows a pattern where the number of clusters is proportional to the number of records. Your ratio of true to false matches is going to be tiny. The false match rate goes as ~O(N**2), but the true match rate goes as ~O(N)
B: some domain with large clusters. Eg the number of clusters is constant, or scales as log of the number of records. Then your ratio of true matches to false matches is going to be much higher than in domain A. Both the true and false match rates go as ~O(n**2)

So I want to do more research, but I wonder if it would be better/possible to use CLUSTER metrics instead of PAIR measurements to determine the threshold, since (I think this is a good assumption?) the next step after everyone filters their pairs is to run some sort of clustering algorithm eg connected components.

0 replies

jstammers · 2024-08-06T19:26:37Z

jstammers
Aug 6, 2024
Author

To explain my app a little further, I'm looking to de-duplicate records of stores that have standard fields (Name, Address components, Phone number, website, lat/long etc.) I have access to some labelled duplicates that have either been manually inspected or via an automated process that I've yet to fully understand. Later on, I'm looking to implement a model to link from one (messy) dataset to another (clean) dataset, but we have automated rules for that so it's less of a priority. In both cases, we expect dupes to be fairly unlikely and matches to be (close to) one-to-one, so the data are definitely in the A category you've described where a high precision is likely, but recall is the main objective.

I agree that having an unsupervised method to determine an optimum threshold would be very useful. I'm less familiar with cluster metrics and given my relatively large amount of labelled data, I decided to go for the quickest approach of using a set of known labels as follows

weights = train_using_labels(comparers, table, table, ...)
blocked = blocker(table)
compared = compare(blocked)
scored = weights.score_compared(compared)

optimum_threshold = find_optimum_threshold(scored, table, method="pr")


links = scored.filter(_.odds > optimum_threshold).select("record_id_l", "record_id_r")
clusters = connected_components(links=links.cache(), records=table.cache()) 
labels_pred = clusters.rename({"label": "component"})
labels_true = table.rename({"label": "label_true"})
evaluate(labels_true=labels_true, labels_pred=labels_pred)

where find_optimum_threshold implements the above logic based on the precision-recall, or ROC curve and evaluate is a one or many cluster evaluation metrics. Once I've determined this threshold, I can also evaluate how well the clustering performs on a hold-out set of data that wasn't used to train the model.

I'm not too familiar with many unsupervised algorithms to determine an optimum, but if we had a measure of the distance between records then something like the silhouette score could work. For text fields, we could use something like damerau_levenshtein to compute the pairwise distance for each field between each pair of records and scale the distances/use some form of dimensionality reduction (PCA, DBSCAN etc.) to get a better measure of the distance between records. I don't know of an dimensionality reduction methods that would be completely backend-agnostic, so perhaps it's best to stick with some normalized distance measure.

It's not clear to me if we would need to compute the distances between every pair of records, or if we can just use the pairs that have been blocked together.

0 replies

jstammers · 2024-10-01T18:44:29Z

jstammers
Oct 1, 2024
Author

Perhaps another way to think about this is to take inspiration from this exercise that describes an algorithm to label each tuple of comparison levels as a match, possible match or unmatch given a maximum tolerable false positive rate and false negative rate. These classifications are obtained solely using the weights from the FS model, so it may be possible to find an optimum threshold that maximises tpr - fpr.

Practically speaking, it seems better to be able to define a region of uncertainty which would require some manual review.

@NickCrews I'd be interested to hear your thoughts on whether it's worth implementing this in LevelWeights. Something along the lines of

def classify_compared(self, table: ir.Table, fpr: float, fnr: float):
    """Classify compared pairs into `match`, `possible match` and `unmatch` according to error rate tolerances"""

2 replies

NickCrews Oct 2, 2024
Maintainer

That's neat! One thing to worry about is that for the probability output from them FS model to be accurate, the prior (the probability two random records are a match, before you even consider the MatchLevels) needs to be accurate. This is because the final probability starts with the prior, and then gets modified with the odds from each comparison. If you start with a bad probability, then your final estimation is meaningless.

The problem is, in practice, the prior is very hard to estimate meaningfully. There is some other issue in the splink issue tracker where I/Robin discusses this in more detail, please link it if you find it. This is why I chose to basically avoid this whole mess, and only use a hardcoded prior of 1. With whichever choice you make, the pair scores are still meaningful relative to each other, ie a score of .034 is half as likely to be a match as .068. The difference is that splink pretends that the output prob can be actually interpreted as an absolute probability. I was skeptical that was a good idea, so I purposely made it non-absolute.

Either way, I think the only principally sound way to interpret the output probabilities are relative to each other: Most pairs are gonna score relatively low, a small amount are gonna score relatively high, and some are gonna be in the middle, and the threshold is somewhere in this middle section. I've experiemented doing something like Otsu's method to dynamically find that threshold, but if I remember that didn't work that well for my dataset.

Or am I misreading that whole exercise wrong, and actually the method it describes doesn't depend on the prior?
Can we think of a way where we parameterize the threshold some other way, not in TPR or FPR? I want to avoid giving people a false sense of confidence in how much they can interpret the results in real-world terms.

jstammers Oct 2, 2024
Author

One thing to worry about is that for the probability output from them FS model to be accurate, the prior (the probability two random records are a match, before you even consider the MatchLevels) needs to be accurate.

That method uses the cumulative sum of the m values u values to set thresholds on the false positive and negative rates, so there's the implicit assumption that these are equivalent to the match/unmatch probabilities - which is not the case.

When you consider Bayes' rule

$$ p(match | level) = \frac{p(level | match) p(match)}{p(level)} $$

and similarly for $p(unmatch | level)$.
With some algebra, I get

$$ \begin{align} \frac{p(match | level)}{p(unmatch | level)} &= \frac{m}{u} \frac{p}{1-p} \\ &\simeq \frac{m}{u} p \quad \text{if $p\ll 1$,} \end{align} $$

where $p = p(match)$

The problem is, in practice, the prior is very hard to estimate meaningfully.

I like the intuition behind that method, but it feels to me that defining the recall is somehow equivalent to placing a strong prior on the match probability and is essentially kicking the can down the road. E.g. if I have a process that I believe returns 50% of the matches and I obtain 10 matches from a set of 1000 records, then I can infer a match probability of 20/1000. How accurate that is depends on how accurately I've estimated the recall. ( I wasn't able to find the issue either, but will link it here if I do).

With a decent amount of labelled data, where all matches are known and we can assume that pairs of records are not a match, we may be able to estimate the probability using a Beta-Binomial model. I'm happy to do some further testing of this if you think there's value in being able to estimate a true probability, but I agree that a relative likelihood is often all that's needed.

In brief, I think this method would only be valid if we had some way of estimating the prior match probability. Otherwise, as you say, the probabilities are only meaningful relative to one another

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimally determining a link probability threshold #52

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Optimally determining a link probability threshold #52

jstammers Aug 3, 2024

Replies: 3 comments · 2 replies

NickCrews Aug 6, 2024 Maintainer

jstammers Aug 6, 2024 Author

jstammers Oct 1, 2024 Author

NickCrews Oct 2, 2024 Maintainer

jstammers Oct 2, 2024 Author

jstammers
Aug 3, 2024

Replies: 3 comments 2 replies

NickCrews
Aug 6, 2024
Maintainer

jstammers
Aug 6, 2024
Author

jstammers
Oct 1, 2024
Author

NickCrews Oct 2, 2024
Maintainer

jstammers Oct 2, 2024
Author