Incremental clustering #38

NickCrews · 2024-04-12T17:54:30Z

NickCrews
Apr 12, 2024
Maintainer

First of all, a quick introduction: I am a dedupe user (I also experimented with splink before choosing the former over it) and I must say that I totally agree with the description/analysis of those libraries made here. The idea of starting a new project building on their strengths and learning from their mistakes is truly awesome!

That said, for a few months now I have been trying to bend dedupe to my needs, which mostly means achieving incremental clustering on a daily increasing collection of street addresses. My question is: are there plans to support incremental clustering starting from a base (large) data set and continuously add new records (without reanalysing everything from scratch)?

The actions I would like to perform on a new record are:

retrieve addresses similar to the new one in O(1) time,

actually add the incoming record to that cluster.

I may be able to help with the development, although I do not have a deep understanding of the subject.

(This request would better fit the "Discussions" session of GitHub, however, as it does not seem to be used in this repository, I am posting this here.)

NickCrews · 2024-04-12T18:15:24Z

NickCrews
Apr 12, 2024
Maintainer Author

first of all, hi @lmores! Would love your thoughts and contributions here. Thanks for the review of mismo so far, it is great to know where people are coming from and what they need. I went on a very similar journey myself of dedupe -> splink -> here.

re incremental clustering:

first, have you considered if this is actually the correct thing to do? consider you already have addresses A: "123 main street" and B: "132 main ave". Say a third new record comes along, C: "123 main ave". This looks to me like all three records are actually instances of the true "123 main ave". So to be correct, you shouldn't just merge C with A or B, but you actually need to reconsider the entirety of all the records you've deduped so far. Bummer in terms of performance, but it might be required for correctness, depending on how strict you want to be. Not saying you NEED to, but worth considering if incremental is even the right approach.

retrieve addresses similar to the new one in O(1) time,

OK, there isn't really ANY algorithm that can do that in O(1) time that I'm aware of 😉 . I think you are just trying to avoid O(N) time.

How I would do this (using the terms "needle" and "haystack" to mean the "new record" and "old records")

on the haystack table, pre-compute some key terms, eg tokens, ngrams, etc. Store this to avoid re-computation.
when you get a new needle, compute those key terms
block on those terms, eg something like mismo.block_one(haystack, needle, lambda left, right: left.terms.unnest() == right.terms.unnest())
do the more expensive pairwise comparisons using eg levenshtein
pick the most promising link from the haystack

This is all fairly well supported in mismo and ibis already, though it's not documented in a how-to guide. It probably should be. Most of the parts are there, you just need to compose them together.

actually add the incoming record to that cluster.

This isn't part of mismo yet. This is tricky because this is so domain-specific with the details of how you want to merge records. I could see mismo doing some basic strategies, like "pick the most common value", and maybe some plumbing to make it nicer, but the actual nitty gritty logic I think you will always be on your own.

Let me know how that helps!

0 replies

lmores · 2024-04-18T18:00:13Z

lmores
Apr 18, 2024

Apologies for the delay, I usually work on the project for addresses deduplication only one day a week.

first, have you considered if this is actually the correct thing to do? consider you already have addresses A: "123 main street" and B: "132 main ave". Say a third new record comes along, C: "123 main ave". This looks to me like all three records are actually instances of the true "123 main ave". So to be correct, you shouldn't just merge C with A or B, but you actually need to reconsider the entirety of all the records you've deduped so far. Bummer in terms of performance, but it might be required for correctness, depending on how strict you want to be. Not saying you NEED to, but worth considering if incremental is even the right approach.

You are totally right, and achieving this behaviour is my final goal (if I will ever get there). Let me sum up this scenario and extend it a bit.

In the beginning the "deduplication system" starts and it has been trained on an initial dataset containing addresses A and B. After the training, A and B may either belong to the same cluster (i.e. they are considered the same), or may live (alone) in their distinct clusters.
A user sends a request to the system asking for all addresses similar to the (never-before-seen) address C. The system does the math and one possible outcome is that the addresses A, B, and C are considered all the same (regardless whether A and B were already considered the same), thus the system returns the set {A, B} (or {A, B, C}) as a response.
When address D = "123 main st." is sent to the system asking for similar addresses, the system returns {A, B, C}.

Note that (in my experience) point 2 actually contains (at least) two distinct actions:

i. Recognizing that records A and B are the same as C
ii. Adding C to the system so that the request for addresses similar to D returns {A, B, C}

For performance reason one may consider to split i and ii, performing i online (e.g. within an HTTP request) and delaying ii (probably more computationally intensive) to a later moment (maybe daily at night, as a batch).

Deciding to merge existing clusters when another comes in may be the right choice in some cases (as also the one-shot deduplication algorithm may establish the same if it were fed A, B, and C from the beginning), however doing it incrementally with low complexity is quite a challenge, imo.

OK, there isn't really ANY algorithm that can do that in O(1) time that I'm aware of 😉 . I think you are just trying to avoid O(N) time.

You are right, forgive my lack of precision.

I could see mismo doing some basic strategies, like "pick the most common value", and maybe some plumbing to make it nicer, but the actual nitty gritty logic I think you will always be on your own.

Agree!

Tomorrow I will try to describe in another comment what I have been experimenting with dedupe and what I achieved so far (although it is very similar to the solution you outlined in your comment).

0 replies

lmores · 2024-04-19T10:32:51Z

lmores
Apr 19, 2024

Here is what I have been experimenting with to achieve some kind of incremental clustering using dedupe.
I will try to describe both the overall strategy and how I actually did it using the dedupe library.

Setup and training phase
In theory this phase should be carried out once before anything else; in practice one may want to run it from time to time to update the weights of the sklearn classifier used internally by dedupe.
- create deduper, an instance of dedupe.Dedupe
- load all previously labelled pairs, i.e. pairs of addresses that have been manually marked as "same" or "distinct" (deduper.mark_pairs(labelled_pairs))
- prepare dedupe to learn about new human-labelled pairs (deduper.prepare_training(whole_dataset), this takes a lot of time!)
- ask dedupe for the pairs of addresses it would like to be labelled and ask to a human (pairs_to_be_labelled = deduper.uncertain_pairs() and then deduper.mark_pairs(human_labelled_pairs))
- finally, update the underlying sklearn classifier using the new information (deduper.train()) and save the final state of the deduper (what dedupe calls settings).
Online incremental clustering
This phase contains the steps to follow to retrieve the addresses similar to a new one.
- create deduper, an instance of dedupe.Dedupe using the settings saved at the end of phase 1
- a new address reach the system and we must return the set of all similar addresses
- first I check if that address is "equal" to an already known one (if so, we can answer easily returning its cluster). When I say "equal" I mean that I run a cleanup procedure on the raw address I receive, stripping jargon such as multiple white spaces, dashes, and other garbage and then I compute the hash of all the address fields and lookup in the existing dataset using that hash. If nothing matches the hash, then the (cleaned up version of the) address has never been seen before, and the difficult part of the task begins.
- First of all, I compute the value of the blocking rules selected by dedupe during the training phase (dedupe.fingerprinter(address_data))
- Then I retrieve from the existing dataset all the addresses having in common at least one blocking rule with those computed for the new address.
- I compute the similarity scores of the the retrieved addresses (features = deduper.data_model.distances(addresses), scores = deduper.classifier.predict_proba(features) and deduper.cluster(scores, clustering_threshold)). This step gives me a cluster containing the new incoming address (say A) plus some other addresses (say B, C, D). Note that B, C and D may or may not currently belong to the same cluster!
- Compute the union U of all clusters containing (at least) one of B, C or D
- Compute again the similarity scores for the addresses in U to obtain a partition of U into clusters, update the information of your dataset saving this new partition into the database (clusters may change!), and return as answer the cluster containing the address A.

This is a very rough approach with probably endless downsides, one of those being that, although clusters may evolve over time, the sklearn classifier used internally by dedupe is never updated during phase 2.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental clustering #38

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Incremental clustering #38

NickCrews Apr 12, 2024 Maintainer

Replies: 3 comments

NickCrews Apr 12, 2024 Maintainer Author

lmores Apr 18, 2024

lmores Apr 19, 2024

NickCrews
Apr 12, 2024
Maintainer

NickCrews
Apr 12, 2024
Maintainer Author

lmores
Apr 18, 2024

lmores
Apr 19, 2024