Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate crosswalk and import cap pdf #4442

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

jtmst
Copy link
Collaborator

@jtmst jtmst commented Sep 11, 2024

Import Harvard Case Law Access Project (CAP) PDFs to CourtListener

Issues Addressed:

Changes Implemented:

  1. Created a management command to generate crosswalk files between CAP and CL data.
  2. Developed a management command to import CAP PDFs to CL using the generated crosswalk.
  3. Implemented S3 storage integration for PDF storage (configurable for local storage in development).
  4. Added error handling and logging for better debugging and monitoring.
  5. Added settings for CAP R2 env variables
  6. Added tests for both commands

Testing Instructions:

Prerequisites:

  • Ensure you have the necessary environment variables set for R2 and S3 access.
  • For local testing, configure the storage to use local file system instead of S3.

Steps:

  1. Generate the crosswalk:

     docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk
    

    This command will create crosswalk files in cl/search/crosswalks/.

  2. Import CAP PDFs:

     docker exec -it cl-django python /opt/courtlistener/manage.py import_harvard_pdfs
    

    This command will use the generated crosswalk to fetch and store PDFs.

  3. To test with specific parameters:

     docker exec -it cl-django python /opt/courtlistener/manage.py generate_cap_crosswalk --reporter "A.2d" --volume 100
    
    

Verification:

  • Check the cl/search/crosswalks/ directory for generated crosswalk files.
  • Verify that PDFs are stored either in S3 or local storage (based on configuration).
  • Examine the logs for any errors or warnings during the process.

Screenshots:

CAP Crosswalk File (Sample Data):

Crosswalk File Sample

Imported PDFs (Local Storage):

Imported PDFs

Copy link

@FRodriguez18 FRodriguez18 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Didn't approve it because I think John proposed some changes, but this looks good to me 👍

cl/search/tests/test_generate_cap_crosswalk.py Outdated Show resolved Hide resolved
@jtmst jtmst requested review from FRodriguez18 and mlissner and removed request for FRodriguez18 September 16, 2024 16:34
@jtmst jtmst marked this pull request as ready for review September 16, 2024 16:35
@mlissner
Copy link
Member

@flooie I just put this on your backlog to prioritize. The background is that we want to have the Harvard PDFs imported into CL and we want to do so regularly. Grab me when you have a sec, and I can give the details before you review or assign to somebody else to review.

Copy link
Member

@quevon24 quevon24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, it is necessary to add typing to the functions in addition to updating the docstrings of the functions so that they are in accordance with the format used in courtlistener.


def find_matching_case(self, case_meta):
try:
citation = case_meta["citations"][0]["cite"].split()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if you could check the citation with get_citations() from eyecite and use the data from the returned object to filter the OpinionCluster objects, here is an example: https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_opinions.py#L636

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this, but I'll note this could affect performance when run against all the data. Run speed doesn't seem to be a huge concern at the moment, but something to note for later if the run time gets to get out of control in testing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this will help accuracy though, thanks for the suggestion

cl/search/management/commands/generate_cap_crosswalk.py Outdated Show resolved Hide resolved
logger = logging.getLogger(__name__)


class HarvardPDFStorage(S3Boto3Storage):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this could be in cl/lib/storage.py, what do you think @flooie?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - I think we should move anything storage related together

cl/search/tests/test_import_harvard_pdfs.py Outdated Show resolved Hide resolved
@jtmst jtmst requested a review from quevon24 September 23, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 🆕 New
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants