Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cl_back_scrape_citations): command to scrape citations #4303

Merged
merged 12 commits into from
Aug 22, 2024

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Aug 13, 2024

  • Adds new command to back scrape citations from courts that publish the citations in the same website where we scrape their opinions
  • Adds tests for the new command
  • Refactors get_binary_content to be re-used by the new command and to reduce copy pasted code
  • create cl.scrapers.exceptions file to hold exceptions raised when ingesting a single case
  • use the exceptions to bubble errors to the main loop, to avoid returning break / continue flags
  • refactor DupChecker to raise errors
  • refactor get_binary_content to raise errors
  • refactor cl_scrape_oral_arguments to account for changes
  • adapted DupChecker and ContentType tests to changes
  • refactors logger.calls to use lazy formatting

Applies to a specific family of court's sites described here:
freelawproject/juriscraper#858 (comment)

grossir and others added 3 commits August 12, 2024 22:36
- Remove "method" argument
- Log errors instead of returning the error message
- Return the cleaned up content using site.cleanup_content
- Update tests
- Update opinions and oral arguments scraper caller to reflect changes
@grossir
Copy link
Contributor Author

grossir commented Aug 15, 2024

We can test this by cloning a few md cases from 2022, for which we do not have citations into the local db, and running the citation backscraper. We expect:

  1. that the cases that do exist get their citations ingested without duplicating the existing clusters
  2. that the cases that do not exist get backscraped into clusters together with their citations

Citations to pre-populate

Commands to run the test

manage.py clone_from_cl --type search.OpinionCluster --id 9355831 9303314

manage.py cl_back_scrape_citations --courts juriscraper.opinions.united_states.state.md --backscrape-start=2022 --backscrape-end=2022 --verbosity 3

Query to check everything went OK

Assumes you are running the query inmediately after the scrapers, otherwise change the interval hours to a bigger value

select * from search_citation where cluster_id in (
select id from search_opinioncluster where docket_id in (
select id from search_docket where court_id = 'md' and date_created > (now() - interval '6 hours')
)
);

and you should see something like this, the cluster_id should be small for new documents, big for "updated" citations. In this case we have a single big cluster_id since we only downloaded one cluster where the hash will match

 id  | volume | reporter | page | type | cluster_id 
-----+--------+----------+------+------+------------
 368 |    482 | Md.      | 47   |    2 |        141
 349 |    482 | Md.      | 469  |    2 |        122
 365 |    482 | Md.      | 15   |    2 |        138
 352 |    482 | Md.      | 342  |    2 |        125
 358 |    482 | Md.      | 137  |    2 |        131
 362 |    482 | Md.      | 81   |    2 |        135
 360 |    482 | Md.      | 82   |    2 |        133
 359 |    482 | Md.      | 138  |    2 |        132
 363 |    482 | Md.      | 79   |    2 |        136
 354 |    482 | Md.      | 341  |    2 |        127
 351 |    482 | Md.      | 343  |    2 |        124
 367 |    482 | Md.      | 9    |    2 |        140
 366 |    482 | Md.      | 12   |    2 |        139
 361 |    482 | Md.      | 139  |    2 |        134
 350 |    482 | Md.      | 395  |    2 |        123
 357 |    482 | Md.      | 602  |    2 |        130
 356 |    482 | Md.      | 198  |    2 |        129
 355 |    482 | Md.      | 223  |    2 |        128
 364 |    482 | Md.      | 48   |    2 |        137
 353 |    482 | Md.      | 272  |    2 |        126
 348 |    482 | Md.      | 159  |    2 |    9303314

@grossir grossir marked this pull request as ready for review August 15, 2024 22:19
@grossir grossir requested a review from mlissner August 15, 2024 22:20
@grossir
Copy link
Contributor Author

grossir commented Aug 15, 2024

@mlissner can you review this?

When writing the tests I realized that we don't have a backscraper for scotus_slip, so I tested this using md. I will implement that backscraper shortly

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few thoughts for you, but this will be very nice to have.

cl/scrapers/utils.py Outdated Show resolved Hide resolved
cl/scrapers/tests.py Outdated Show resolved Hide resolved
Reword docstrings, catch exceptions and refactor code following code review
Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes so far look good. Just need that last refactor.

- create cl.scrapers.exceptions file to hold exceptions raised when ingesting a single case
- use the exceptions to bubble errors to the main loop, to avoid returning break / continue flags
- refactor DupChecker to raise errors
- refactor get_binary_content to raise errors
- refactor cl_scrape_oral_arguments to new paradigm
- cl_back_scrape_citations can now re-scrape a single case without re-downloading the binary content or manipulating the site object
- adapted DupChecker and ContentType tests to changes
- refactores logger.calls to use lazy formatting
@grossir grossir requested a review from mlissner August 20, 2024 18:15
@grossir
Copy link
Contributor Author

grossir commented Aug 20, 2024

@mlissner can you review this again? I updated the PR with the requested changes

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts for you. Looks like a nice refactor though.

@mlissner
Copy link
Member

Very nice.

I've updated to HEAD and set for auto-merge. Thank you!

@grossir grossir dismissed mlissner’s stale review August 22, 2024 14:10

I am using the "dismiss review" button, since I understand this PR has been approved, but @mlissner didn't clear the "changes requested" status and merging is still blocked

@mlissner mlissner merged commit e70bdb7 into freelawproject:main Aug 22, 2024
9 checks passed
Copy link

sentry-io bot commented Aug 22, 2024

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

  • ‼️ SSLError: HTTPSConnectionPool(host='www.jud.ct.gov', port=443): Max retries exceeded with url: /external/su... cl.scrapers.utils in get_binary_content View Issue
  • ‼️ ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) cl.scrapers.utils in get_binary_content View Issue
  • ‼️ WriteTimeout cl.lib.microservice_utils in microservice View Issue
  • ‼️ WriteTimeout cl.lib.microservice_utils in microservice View Issue
  • ‼️ ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer')) cl.scrapers.utils in get_binary_content View Issue

Did you find this useful? React with a 👍 or 👎

grossir added a commit to grossir/courtlistener that referenced this pull request Sep 6, 2024
Related to preventing further duplicates as seen on freelawproject#4376, due to changes introduced in freelawproject#4303

- Refactor tests for DupChecker.press_on method: replaces fixtures,  loops and if clauses by explicit test objects and explicit press_on calls for each scenario
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants