Clean up duplicates from a recent DupChecker bug #4376

grossir · 2024-08-29T16:20:59Z

The PR that introduced the bug was merged on August 21, 2024; so we have scraped opinions duplicated with the same hash since that date until a fix is merged (WIP here)

For example, all of these opinions have the same hash 1, 2, 3

We will have to delete the duplicated opinions, the clusters, and related objects like citations

mlissner · 2024-08-29T16:37:10Z

Wah wah.

Why don't we have a unique constraint on that hash? Seems like the database could have prevented this...

Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303 - Refactor tests for DupChecker.press_on method: replaces fixtures, loops and if clauses by explicit test objects and explicit press_on calls for each scenario

grossir · 2024-09-06T00:31:39Z

\d+ search_opinion on the docker compose DB returns

Indexes:
    "search_opinion_pkey" PRIMARY KEY, btree (id)
    "search_opinion_author_id_69e3caa8" btree (author_id)
    "search_opinion_cluster_id_09bd537a" btree (cluster_id)
    "search_opinion_date_created_76a4ddf9" btree (date_created)
    "search_opinion_date_modified_524fb7ff" btree (date_modified)
    "search_opinion_download_url_8428ad91" btree (download_url)
    "search_opinion_download_url_8428ad91_like" btree (download_url varchar_pattern_ops)
    "search_opinion_extracted_by_ocr_122ced11" btree (extracted_by_ocr)
    "search_opinion_local_path_8c124953" btree (local_path)
    "search_opinion_local_path_8c124953_like" btree (local_path varchar_pattern_ops)
    "search_opinion_sha1_62196033" btree (sha1)
    "search_opinion_sha1_62196033_like" btree (sha1 varchar_pattern_ops)
    "unique_opinion_ordering_key" UNIQUE CONSTRAINT, btree (cluster_id, ordering_key)

search_opinion_sha1_62196033" btree (sha1)

So, it is indeed not unique (it should have "UNIQUE CONSTRAINT").
The index must be dropped and then build again, there is no way to add the UNIQUE constraint via ALTER INDEX
But the duplicates must be corrected before re-creating it.

mlissner · 2024-09-06T18:41:29Z

Yeah, makes sense. Let's begin with fixing the dupes and then return here.

When we add the unique constraint, I think we'll want a migration that adds a new index and then removes the old one. That way, if we have look-ups that are coming in during the migration, there will always be an index available.

anseljh · 2024-10-10T03:11:11Z

Here's an instance of this I just ran into for a California Supreme Court case published August 22, 2024. It's perhaps worth a closer look because it also indicates inconsistent citation parsing.

Query: https://www.courtlistener.com/?q=Rattagan&type=o&order_by=score%20desc&stat_Published=on&court=cal

3 copies of the opinion:

Opinions 10049082 and 10050073 show 16 authorities, but opinion 10072000 shows only 7. Why would that be?

Tagging in @flooie for the citation part.

flooie · 2024-10-10T16:38:58Z

@anseljh I had a chance to look into this, but I don’t have a definitive explanation for why one citation has more references than the other. It does seem odd at first glance. What I’ve noticed is that many of the missing supras aren’t showing up, which might explain some of the differences.

It doesn’t seem to be a parsing issue, though; it looks more like a search-related problem. Nearly all (though not all) of the missing authorities are actually marked as citation no-link in the source code. This suggests the issue lies with search and citation discovery, not parsing. That said, there are still quite a lot of missing no-link citations.

I’m wondering if we should consider highlighting found citations or making them visually distinct in some way—perhaps by using an underline or another marker?

grossir mentioned this issue Sep 6, 2024

tests(scrapers): refactor tests for DupChecker.press_on #4425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up duplicates from a recent DupChecker bug #4376

Clean up duplicates from a recent DupChecker bug #4376

grossir commented Aug 29, 2024

mlissner commented Aug 29, 2024

grossir commented Sep 6, 2024

mlissner commented Sep 6, 2024

anseljh commented Oct 10, 2024

flooie commented Oct 10, 2024

Clean up duplicates from a recent DupChecker bug #4376

Clean up duplicates from a recent DupChecker bug #4376

Comments

grossir commented Aug 29, 2024

mlissner commented Aug 29, 2024

grossir commented Sep 6, 2024

mlissner commented Sep 6, 2024

anseljh commented Oct 10, 2024

flooie commented Oct 10, 2024