feat(cl_back_scrape_citations): command to scrape citations #4303

grossir · 2024-08-13T03:38:26Z

Adds new command to back scrape citations from courts that publish the citations in the same website where we scrape their opinions
Adds tests for the new command
Refactors get_binary_content to be re-used by the new command and to reduce copy pasted code
create cl.scrapers.exceptions file to hold exceptions raised when ingesting a single case
use the exceptions to bubble errors to the main loop, to avoid returning break / continue flags
refactor DupChecker to raise errors
refactor get_binary_content to raise errors
refactor cl_scrape_oral_arguments to account for changes
adapted DupChecker and ContentType tests to changes
refactors logger.calls to use lazy formatting

Applies to a specific family of court's sites described here:
freelawproject/juriscraper#858 (comment)

- Remove "method" argument - Log errors instead of returning the error message - Return the cleaned up content using site.cleanup_content - Update tests - Update opinions and oral arguments scraper caller to reflect changes

for more information, see https://pre-commit.ci

grossir · 2024-08-15T16:24:11Z

We can test this by cloning a few md cases from 2022, for which we do not have citations into the local db, and running the citation backscraper. We expect:

that the cases that do exist get their citations ingested without duplicating the existing clusters
that the cases that do not exist get backscraped into clusters together with their citations

Citations to pre-populate

Citation: 482 Md. 223 (changed hash, opinion was corrected)
https://www.courtlistener.com/opinion/9355831/tapestry-inc-v-factory-mut-insurance/
http://www.mdcourts.gov/data/opinions/coa/2022/1a22m.pdf
482 Md. 159 (hash has not changed, opinion was not corrected)
CL
http://www.mdcourts.gov/data/opinions/coa/2022/6a22.pdf

Commands to run the test

manage.py clone_from_cl --type search.OpinionCluster --id 9355831 9303314

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2022 --backscrape-end=2022 --verbosity 3

Query to check everything went OK

Assumes you are running the query inmediately after the scrapers, otherwise change the interval hours to a bigger value

select * from search_citation where cluster_id in (
select id from search_opinioncluster where docket_id in (
select id from search_docket where court_id = 'md' and date_created > (now() - interval '6 hours')
)
);

and you should see something like this, the cluster_id should be small for new documents, big for "updated" citations. In this case we have a single big cluster_id since we only downloaded one cluster where the hash will match

 id  | volume | reporter | page | type | cluster_id 
-----+--------+----------+------+------+------------
 368 |    482 | Md.      | 47   |    2 |        141
 349 |    482 | Md.      | 469  |    2 |        122
 365 |    482 | Md.      | 15   |    2 |        138
 352 |    482 | Md.      | 342  |    2 |        125
 358 |    482 | Md.      | 137  |    2 |        131
 362 |    482 | Md.      | 81   |    2 |        135
 360 |    482 | Md.      | 82   |    2 |        133
 359 |    482 | Md.      | 138  |    2 |        132
 363 |    482 | Md.      | 79   |    2 |        136
 354 |    482 | Md.      | 341  |    2 |        127
 351 |    482 | Md.      | 343  |    2 |        124
 367 |    482 | Md.      | 9    |    2 |        140
 366 |    482 | Md.      | 12   |    2 |        139
 361 |    482 | Md.      | 139  |    2 |        134
 350 |    482 | Md.      | 395  |    2 |        123
 357 |    482 | Md.      | 602  |    2 |        130
 356 |    482 | Md.      | 198  |    2 |        129
 355 |    482 | Md.      | 223  |    2 |        128
 364 |    482 | Md.      | 48   |    2 |        137
 353 |    482 | Md.      | 272  |    2 |        126
 348 |    482 | Md.      | 159  |    2 |    9303314

Related to freelawproject/juriscraper#858

grossir · 2024-08-15T22:24:30Z

@mlissner can you review this?

When writing the tests I realized that we don't have a backscraper for scotus_slip, so I tested this using md. I will implement that backscraper shortly

mlissner

Looks good. A few thoughts for you, but this will be very nice to have.

cl/scrapers/management/commands/cl_back_scrape_citations.py

cl/scrapers/utils.py

cl/scrapers/tests.py

Reword docstrings, catch exceptions and refactor code following code review

mlissner

Changes so far look good. Just need that last refactor.

- create cl.scrapers.exceptions file to hold exceptions raised when ingesting a single case - use the exceptions to bubble errors to the main loop, to avoid returning break / continue flags - refactor DupChecker to raise errors - refactor get_binary_content to raise errors - refactor cl_scrape_oral_arguments to new paradigm - cl_back_scrape_citations can now re-scrape a single case without re-downloading the binary content or manipulating the site object - adapted DupChecker and ContentType tests to changes - refactores logger.calls to use lazy formatting

grossir · 2024-08-20T18:15:50Z

@mlissner can you review this again? I updated the PR with the requested changes

mlissner

A few thoughts for you. Looks like a nice refactor though.

cl/scrapers/management/commands/cl_scrape_opinions.py

cl/scrapers/management/commands/cl_back_scrape_citations.py

mlissner · 2024-08-21T23:56:49Z

Very nice.

I've updated to HEAD and set for auto-merge. Thank you!

@mlissner

I am using the "dismiss review" button, since I understand this PR has been approved, but @mlissner didn't clear the "changes requested" status and merging is still blocked

sentry-io · 2024-08-22T15:07:47Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ SSLError: HTTPSConnectionPool(host='www.jud.ct.gov', port=443): Max retries exceeded with url: /external/su... cl.scrapers.utils in get_binary_content View Issue
‼️ ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) cl.scrapers.utils in get_binary_content View Issue
‼️ WriteTimeout cl.lib.microservice_utils in microservice View Issue
‼️ WriteTimeout cl.lib.microservice_utils in microservice View Issue
‼️ ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) cl.scrapers.utils in get_binary_content View Issue

_{Did you find this useful? React with a 👍 or 👎}

Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303 - Refactor tests for DupChecker.press_on method: replaces fixtures, loops and if clauses by explicit test objects and explicit press_on calls for each scenario

grossir and others added 3 commits August 12, 2024 22:36

refactor(scrapers.utils.get_binary_content)

49aa2b7

- Remove "method" argument - Log errors instead of returning the error message - Return the cleaned up content using site.cleanup_content - Update tests - Update opinions and oral arguments scraper caller to reflect changes

feat(cl_back_scrape_citations): command to scrape citations

5fb7ba9

[pre-commit.ci] auto fixes from pre-commit.com hooks

2294515

for more information, see https://pre-commit.ci

grossir force-pushed the scrape_citations_command branch from 0255163 to d0ea84b Compare August 15, 2024 15:49

fix(scrapers.tests): test new logger calls inside get_binary_content

fac3f13

grossir force-pushed the scrape_citations_command branch from d0ea84b to fac3f13 Compare August 15, 2024 16:00

grossir added 2 commits August 15, 2024 17:07

feat(scrapers.tests): add tests for cl_back_scrape_citations

ad975d9

Related to freelawproject/juriscraper#858

Merge branch 'main' into scrape_citations_command

f272e77

grossir marked this pull request as ready for review August 15, 2024 22:19

grossir requested a review from mlissner August 15, 2024 22:20

mlissner requested changes Aug 16, 2024

View reviewed changes

grossir added 2 commits August 16, 2024 14:04

refactor(cl_back_scrape_citations)

fc3e1a1

Reword docstrings, catch exceptions and refactor code following code review

Merge branch 'main' into scrape_citations_command

6228358

mlissner reviewed Aug 17, 2024

View reviewed changes

grossir added 2 commits August 20, 2024 11:35

Merge branch 'main' into scrape_citations_command

9d034de

grossir requested a review from mlissner August 20, 2024 18:15

mlissner previously requested changes Aug 20, 2024

View reviewed changes

cl/scrapers/management/commands/cl_scrape_opinions.py Outdated Show resolved Hide resolved

cl/scrapers/management/commands/cl_back_scrape_citations.py Show resolved Hide resolved

grossir and others added 2 commits August 21, 2024 12:27

refactor(cl_scrape_opinions): renamed variable used in logging

ace253e

Merge branch 'main' into scrape_citations_command

b91fbf5

mlissner enabled auto-merge August 21, 2024 23:56

mlissner merged commit e70bdb7 into freelawproject:main Aug 22, 2024
9 checks passed

This was referenced Aug 29, 2024

fix(scrapers.DupChecker): ensure raising SingleDuplicateError #4375

Merged

Clean up duplicates from a recent DupChecker bug #4376

Open

grossir mentioned this pull request Sep 6, 2024

tests(scrapers): refactor tests for DupChecker.press_on #4425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cl_back_scrape_citations): command to scrape citations #4303

feat(cl_back_scrape_citations): command to scrape citations #4303

grossir commented Aug 13, 2024 •

edited

Loading

grossir commented Aug 15, 2024 •

edited

Loading

grossir commented Aug 15, 2024

mlissner left a comment

mlissner left a comment

grossir commented Aug 20, 2024

mlissner left a comment

mlissner commented Aug 21, 2024

sentry-io bot commented Aug 22, 2024 •

edited

Loading

feat(cl_back_scrape_citations): command to scrape citations #4303

feat(cl_back_scrape_citations): command to scrape citations #4303

Conversation

grossir commented Aug 13, 2024 • edited Loading

grossir commented Aug 15, 2024 • edited Loading

Commands to run the test

Query to check everything went OK

grossir commented Aug 15, 2024

mlissner left a comment

Choose a reason for hiding this comment

mlissner left a comment

Choose a reason for hiding this comment

grossir commented Aug 20, 2024

mlissner left a comment

Choose a reason for hiding this comment

mlissner commented Aug 21, 2024

sentry-io bot commented Aug 22, 2024 • edited Loading

Suspect Issues

grossir commented Aug 13, 2024 •

edited

Loading

grossir commented Aug 15, 2024 •

edited

Loading

sentry-io bot commented Aug 22, 2024 •

edited

Loading