Generate crosswalk and import cap pdf #4442

jtmst · 2024-09-11T14:35:36Z

Import Harvard Case Law Access Project (CAP) PDFs to CourtListener

Issues Addressed:

Changes Implemented:

Created a management command to generate crosswalk files between CAP and CL data.
Developed a management command to import CAP PDFs to CL using the generated crosswalk.
Implemented S3 storage integration for PDF storage (configurable for local storage in development).
Added error handling and logging for better debugging and monitoring.
Added settings for CAP R2 env variables
Added tests for both commands

Testing Instructions:

Prerequisites:

Ensure you have the necessary environment variables set for R2 and S3 access.
For local testing, configure the storage to use local file system instead of S3.

Steps:

Generate the crosswalk:

 docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk

This command will create crosswalk files in cl/search/crosswalks/.

Import CAP PDFs:
```
 docker exec -it cl-django python /opt/courtlistener/manage.py import_harvard_pdfs
```
This command will use the generated crosswalk to fetch and store PDFs.

To test with specific parameters:

 docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk --reporter "A.2d" --volume 100

Verification:

Check the cl/search/crosswalks/ directory for generated crosswalk files.
Verify that PDFs are stored either in S3 or local storage (based on configuration).
Examine the logs for any errors or warnings during the process.

Screenshots:

CAP Crosswalk File (Sample Data):

Imported PDFs (Local Storage):

for more information, see https://pre-commit.ci

FRodriguez18

Looks good! Didn't approve it because I think John proposed some changes, but this looks good to me 👍

cl/search/tests/test_generate_cap_crosswalk.py

mlissner · 2024-09-17T22:53:18Z

@flooie I just put this on your backlog to prioritize. The background is that we want to have the Harvard PDFs imported into CL and we want to do so regularly. Grab me when you have a sec, and I can give the details before you review or assign to somebody else to review.

quevon24

In general, it is necessary to add typing to the functions in addition to updating the docstrings of the functions so that they are in accordance with the format used in courtlistener.

quevon24 · 2024-09-19T19:28:42Z

cl/search/management/commands/generate_cap_crosswalk.py

+
+    def find_matching_case(self, case_meta):
+        try:
+            citation = case_meta["citations"][0]["cite"].split()


It would be great if you could check the citation with get_citations() from eyecite and use the data from the returned object to filter the OpinionCluster objects, here is an example: https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_opinions.py#L636

I've added this, but I'll note this could affect performance when run against all the data. Run speed doesn't seem to be a huge concern at the moment, but something to note for later if the run time gets to get out of control in testing.

Agreed that this will help accuracy though, thanks for the suggestion

cl/search/management/commands/generate_cap_crosswalk.py

quevon24 · 2024-09-19T19:32:31Z

cl/search/management/commands/import_harvard_pdfs.py

+logger = logging.getLogger(__name__)
+
+
+class HarvardPDFStorage(S3Boto3Storage):


Maybe this could be in cl/lib/storage.py, what do you think @flooie?

Yeah - I think we should move anything storage related together

cl/search/tests/test_generate_cap_crosswalk.py

cl/search/tests/test_import_harvard_pdfs.py

…d-import-cap-pdf

for more information, see https://pre-commit.ci

jtmst and others added 8 commits September 10, 2024 09:41

crosswalk and import script testing and wip

e345a3b

Testing work for import and crosswalk scripts

480ccf9

import logic, comments, and crosswalk test

bfbc594

comments for crosswalk script

53178c7

[pre-commit.ci] auto fixes from pre-commit.com hooks

d91fb44

for more information, see https://pre-commit.ci

Merge branch 'main' into generate-crosswalk-and-import-cap-pdf

05d5690

modify test env vars

3b6703e

sample crosswalk for test

bf462c3

FRodriguez18 reviewed Sep 12, 2024

View reviewed changes

cl/search/tests/test_generate_cap_crosswalk.py Outdated Show resolved Hide resolved

Updated vars/tests, comments for new vars

2d1a6ff

jtmst requested review from FRodriguez18 and mlissner and removed request for FRodriguez18 September 16, 2024 16:34

jtmst marked this pull request as ready for review September 16, 2024 16:35

quevon24 reviewed Sep 19, 2024

View reviewed changes

jtmst and others added 3 commits September 20, 2024 10:07

Merge remote-tracking branch 'origin/main' into generate-crosswalk-an…

67eca14

…d-import-cap-pdf

PR Feedback, cleanup, simplify tests for errors

f9869a2

[pre-commit.ci] auto fixes from pre-commit.com hooks

12aa4e1

for more information, see https://pre-commit.ci

jtmst requested a review from quevon24 September 23, 2024 20:39

Merge branch 'main' into generate-crosswalk-and-import-cap-pdf

81d8b22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generate crosswalk and import cap pdf #4442

Generate crosswalk and import cap pdf #4442

jtmst commented Sep 11, 2024 •

edited

Loading

FRodriguez18 left a comment

mlissner commented Sep 17, 2024

quevon24 left a comment

quevon24 Sep 19, 2024

jtmst Sep 20, 2024

jtmst Sep 20, 2024

quevon24 Sep 19, 2024

flooie Sep 23, 2024

		logger = logging.getLogger(__name__)


		class HarvardPDFStorage(S3Boto3Storage):

Generate crosswalk and import cap pdf #4442

Are you sure you want to change the base?

Generate crosswalk and import cap pdf #4442

Conversation

jtmst commented Sep 11, 2024 • edited Loading

Import Harvard Case Law Access Project (CAP) PDFs to CourtListener

Issues Addressed:

Changes Implemented:

Testing Instructions:

Prerequisites:

Steps:

Verification:

Screenshots:

CAP Crosswalk File (Sample Data):

Imported PDFs (Local Storage):

FRodriguez18 left a comment

Choose a reason for hiding this comment

mlissner commented Sep 17, 2024

quevon24 left a comment

Choose a reason for hiding this comment

quevon24 Sep 19, 2024

Choose a reason for hiding this comment

jtmst Sep 20, 2024

Choose a reason for hiding this comment

jtmst Sep 20, 2024

Choose a reason for hiding this comment

quevon24 Sep 19, 2024

Choose a reason for hiding this comment

flooie Sep 23, 2024

Choose a reason for hiding this comment

jtmst commented Sep 11, 2024 •

edited

Loading