-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split merged ids if they lose orig ids #46
Conversation
No need for rebasing 👍 |
☂️ Python Coverage
Overall Coverage
New FilesNo new covered files... Modified Files
|
2918847
to
b70543c
Compare
Also update tests and add some uncommitted parts of the obsolete link breaking method
b70543c
to
7635b22
Compare
Brian, I talked with James about this and he approves the method but doesn't have time to do a detailed code review. I'm sorry to ask you to take a look at this unfamiliar code, but if you'd be able to take a look at this early next week so we can get this merged and the pipeline running again that would be very helpful. Let me know if you'd rather go through it live. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Closes #45
Closes #47
This PR does two things:
The latter was the result of my descent into investigating what was going on with your spurious matches. The matches you found are not present in the simhash results, which was my original fear. As I looked into simhash again though, I realized that there are extremely few results found by simhash but not the rest of our method (374K links from 53K orig ids). I think as our vendor composition, vendor data, and linkage method (e.g. incorporating cross-dataset links from our vendors, updating the criteria we use to make a match) have evolved, simhash has become less relevant. I think it's just not worth it at this point between the obscure false positives it can introduce and the complexity it requires to incorporate.