Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split merged ids if they lose orig ids #46

Merged
merged 4 commits into from
Oct 3, 2024

Conversation

jmelot
Copy link
Member

@jmelot jmelot commented Aug 14, 2024

Closes #45
Closes #47

This PR does two things:

  • Discards merged ids that lose an article, forcing new merged ids to be created. This resolves all the cases you originally identified. It also removes about 4.2% of current merged ids, which is higher than I expected. Note that this method can be a little overzealous and doesn't distinguish between ids that are dropped because they are removed by a vendor and ids that still exist in the corpus but are now part of another match set. I think this is the right call because it will force a rerun of e.g. predictions over such articles, whose metadata may change if they lose a merged id, regardless of downstream implementation details. But I want your take.
  • Removes simhash

The latter was the result of my descent into investigating what was going on with your spurious matches. The matches you found are not present in the simhash results, which was my original fear. As I looked into simhash again though, I realized that there are extremely few results found by simhash but not the rest of our method (374K links from 53K orig ids). I think as our vendor composition, vendor data, and linkage method (e.g. incorporating cross-dataset links from our vendors, updating the criteria we use to make a match) have evolved, simhash has become less relevant. I think it's just not worth it at this point between the obscure false positives it can introduce and the complexity it requires to incorporate.

Copy link

github-actions bot commented Aug 14, 2024

No need for rebasing 👍
behind_count is 0
ahead_count is 4

Copy link

github-actions bot commented Aug 14, 2024

☂️ Python Coverage

current status: ✅

Overall Coverage

Lines Covered Coverage Threshold Status
299 247 83% 0% 🟢

New Files

No new covered files...

Modified Files

File Coverage Status
tests/test_create_merge_ids.py 100% 🟢
utils/create_merge_ids.py 82% 🟢
TOTAL 91% 🟢

updated for commit: c132c82 by action🐍

Also update tests and add some uncommitted parts of the obsolete link breaking method
@jmelot jmelot marked this pull request as ready for review September 25, 2024 14:46
@jmelot jmelot requested review from jamesdunham and brianlove and removed request for jamesdunham September 25, 2024 14:47
@jmelot
Copy link
Member Author

jmelot commented Sep 27, 2024

Brian, I talked with James about this and he approves the method but doesn't have time to do a detailed code review. I'm sorry to ask you to take a look at this unfamiliar code, but if you'd be able to take a look at this early next week so we can get this merged and the pipeline running again that would be very helpful. Let me know if you'd rather go through it live.

linkage_dag.py Show resolved Hide resolved
Copy link
Contributor

@brianlove brianlove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@brianlove brianlove merged commit fe795c0 into master Oct 3, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Speed up id generation stage FP article links for review
2 participants