-
-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean up duplicates from a recent DupChecker bug #4376
Comments
Wah wah. Why don't we have a unique constraint on that hash? Seems like the database could have prevented this... |
Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303 - Refactor tests for DupChecker.press_on method: replaces fixtures, loops and if clauses by explicit test objects and explicit press_on calls for each scenario
So, it is indeed not unique (it should have "UNIQUE CONSTRAINT"). |
Yeah, makes sense. Let's begin with fixing the dupes and then return here. When we add the unique constraint, I think we'll want a migration that adds a new index and then removes the old one. That way, if we have look-ups that are coming in during the migration, there will always be an index available. |
Here's an instance of this I just ran into for a California Supreme Court case published August 22, 2024. It's perhaps worth a closer look because it also indicates inconsistent citation parsing. Query: https://www.courtlistener.com/?q=Rattagan&type=o&order_by=score%20desc&stat_Published=on&court=cal 3 copies of the opinion:
Opinions 10049082 and 10050073 show 16 authorities, but opinion 10072000 shows only 7. Why would that be? Tagging in @flooie for the citation part. |
@anseljh I had a chance to look into this, but I don’t have a definitive explanation for why one citation has more references than the other. It does seem odd at first glance. What I’ve noticed is that many of the missing supras aren’t showing up, which might explain some of the differences. It doesn’t seem to be a parsing issue, though; it looks more like a search-related problem. Nearly all (though not all) of the missing authorities are actually marked as citation no-link in the source code. This suggests the issue lies with search and citation discovery, not parsing. That said, there are still quite a lot of missing no-link citations. I’m wondering if we should consider highlighting found citations or making them visually distinct in some way—perhaps by using an underline or another marker? |
The PR that introduced the bug was merged on August 21, 2024; so we have scraped opinions duplicated with the same hash since that date until a fix is merged (WIP here)
For example, all of these opinions have the same hash 1, 2, 3
We will have to delete the duplicated opinions, the clusters, and related objects like citations
The text was updated successfully, but these errors were encountered: