Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up duplicates from a recent DupChecker bug #4376

Open
grossir opened this issue Aug 29, 2024 · 5 comments
Open

Clean up duplicates from a recent DupChecker bug #4376

grossir opened this issue Aug 29, 2024 · 5 comments

Comments

@grossir
Copy link
Contributor

grossir commented Aug 29, 2024

The PR that introduced the bug was merged on August 21, 2024; so we have scraped opinions duplicated with the same hash since that date until a fix is merged (WIP here)

For example, all of these opinions have the same hash 1, 2, 3

We will have to delete the duplicated opinions, the clusters, and related objects like citations

@mlissner
Copy link
Member

Wah wah.

Why don't we have a unique constraint on that hash? Seems like the database could have prevented this...

grossir added a commit to grossir/courtlistener that referenced this issue Sep 6, 2024
Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303

- Refactor tests for DupChecker.press_on method: replaces fixtures,  loops and if clauses by explicit test objects and explicit press_on calls for each scenario
@grossir
Copy link
Contributor Author

grossir commented Sep 6, 2024

\d+ search_opinion on the docker compose DB returns

Indexes:
    "search_opinion_pkey" PRIMARY KEY, btree (id)
    "search_opinion_author_id_69e3caa8" btree (author_id)
    "search_opinion_cluster_id_09bd537a" btree (cluster_id)
    "search_opinion_date_created_76a4ddf9" btree (date_created)
    "search_opinion_date_modified_524fb7ff" btree (date_modified)
    "search_opinion_download_url_8428ad91" btree (download_url)
    "search_opinion_download_url_8428ad91_like" btree (download_url varchar_pattern_ops)
    "search_opinion_extracted_by_ocr_122ced11" btree (extracted_by_ocr)
    "search_opinion_local_path_8c124953" btree (local_path)
    "search_opinion_local_path_8c124953_like" btree (local_path varchar_pattern_ops)
    "search_opinion_sha1_62196033" btree (sha1)
    "search_opinion_sha1_62196033_like" btree (sha1 varchar_pattern_ops)
    "unique_opinion_ordering_key" UNIQUE CONSTRAINT, btree (cluster_id, ordering_key)

search_opinion_sha1_62196033" btree (sha1)

So, it is indeed not unique (it should have "UNIQUE CONSTRAINT").
The index must be dropped and then build again, there is no way to add the UNIQUE constraint via ALTER INDEX
But the duplicates must be corrected before re-creating it.

@mlissner
Copy link
Member

mlissner commented Sep 6, 2024

Yeah, makes sense. Let's begin with fixing the dupes and then return here.

When we add the unique constraint, I think we'll want a migration that adds a new index and then removes the old one. That way, if we have look-ups that are coming in during the migration, there will always be an index available.

@anseljh
Copy link
Member

anseljh commented Oct 10, 2024

Here's an instance of this I just ran into for a California Supreme Court case published August 22, 2024. It's perhaps worth a closer look because it also indicates inconsistent citation parsing.

Query: https://www.courtlistener.com/?q=Rattagan&type=o&order_by=score%20desc&stat_Published=on&court=cal

3 copies of the opinion:

Opinions 10049082 and 10050073 show 16 authorities, but opinion 10072000 shows only 7. Why would that be?

Tagging in @flooie for the citation part.

@flooie
Copy link
Contributor

flooie commented Oct 10, 2024

@anseljh I had a chance to look into this, but I don’t have a definitive explanation for why one citation has more references than the other. It does seem odd at first glance. What I’ve noticed is that many of the missing supras aren’t showing up, which might explain some of the differences.

It doesn’t seem to be a parsing issue, though; it looks more like a search-related problem. Nearly all (though not all) of the missing authorities are actually marked as citation no-link in the source code. This suggests the issue lies with search and citation discovery, not parsing. That said, there are still quite a lot of missing no-link citations.

I’m wondering if we should consider highlighting found citations or making them visually distinct in some way—perhaps by using an underline or another marker?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

No branches or pull requests

4 participants