Replies: 3 comments
-
first of all, hi @lmores! Would love your thoughts and contributions here. Thanks for the review of mismo so far, it is great to know where people are coming from and what they need. I went on a very similar journey myself of dedupe -> splink -> here. re incremental clustering: first, have you considered if this is actually the correct thing to do? consider you already have addresses A: "123 main street" and B: "132 main ave". Say a third new record comes along, C: "123 main ave". This looks to me like all three records are actually instances of the true "123 main ave". So to be correct, you shouldn't just merge C with A or B, but you actually need to reconsider the entirety of all the records you've deduped so far. Bummer in terms of performance, but it might be required for correctness, depending on how strict you want to be. Not saying you NEED to, but worth considering if incremental is even the right approach.
OK, there isn't really ANY algorithm that can do that in O(1) time that I'm aware of 😉 . I think you are just trying to avoid O(N) time. How I would do this (using the terms "needle" and "haystack" to mean the "new record" and "old records")
This is all fairly well supported in mismo and ibis already, though it's not documented in a how-to guide. It probably should be. Most of the parts are there, you just need to compose them together.
This isn't part of mismo yet. This is tricky because this is so domain-specific with the details of how you want to merge records. I could see mismo doing some basic strategies, like "pick the most common value", and maybe some plumbing to make it nicer, but the actual nitty gritty logic I think you will always be on your own. Let me know how that helps! |
Beta Was this translation helpful? Give feedback.
-
Apologies for the delay, I usually work on the project for addresses deduplication only one day a week.
You are totally right, and achieving this behaviour is my final goal (if I will ever get there). Let me sum up this scenario and extend it a bit.
Note that (in my experience) point 2 actually contains (at least) two distinct actions: i. Recognizing that records For performance reason one may consider to split Deciding to merge existing clusters when another comes in may be the right choice in some cases (as also the one-shot deduplication algorithm may establish the same if it were fed
You are right, forgive my lack of precision.
Agree! Tomorrow I will try to describe in another comment what I have been experimenting with dedupe and what I achieved so far (although it is very similar to the solution you outlined in your comment). |
Beta Was this translation helpful? Give feedback.
-
Here is what I have been experimenting with to achieve some kind of incremental clustering using dedupe.
This is a very rough approach with probably endless downsides, one of those being that, although clusters may evolve over time, the sklearn classifier used internally by dedupe is never updated during phase 2. |
Beta Was this translation helpful? Give feedback.
-
From @lmores in #36
Beta Was this translation helpful? Give feedback.
All reactions