-
-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generate crosswalk and import cap pdf #4442
base: main
Are you sure you want to change the base?
Generate crosswalk and import cap pdf #4442
Conversation
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Didn't approve it because I think John proposed some changes, but this looks good to me 👍
@flooie I just put this on your backlog to prioritize. The background is that we want to have the Harvard PDFs imported into CL and we want to do so regularly. Grab me when you have a sec, and I can give the details before you review or assign to somebody else to review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, it is necessary to add typing to the functions in addition to updating the docstrings of the functions so that they are in accordance with the format used in courtlistener.
|
||
def find_matching_case(self, case_meta): | ||
try: | ||
citation = case_meta["citations"][0]["cite"].split() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if you could check the citation with get_citations() from eyecite and use the data from the returned object to filter the OpinionCluster objects, here is an example: https://github.com/freelawproject/courtlistener/blob/main/cl/corpus_importer/management/commands/harvard_opinions.py#L636
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this, but I'll note this could affect performance when run against all the data. Run speed doesn't seem to be a huge concern at the moment, but something to note for later if the run time gets to get out of control in testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this will help accuracy though, thanks for the suggestion
logger = logging.getLogger(__name__) | ||
|
||
|
||
class HarvardPDFStorage(S3Boto3Storage): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this could be in cl/lib/storage.py, what do you think @flooie?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I think we should move anything storage related together
Import Harvard Case Law Access Project (CAP) PDFs to CourtListener
Issues Addressed:
Changes Implemented:
Testing Instructions:
Prerequisites:
Steps:
Generate the crosswalk:
This command will create crosswalk files in
cl/search/crosswalks/
.Import CAP PDFs:
This command will use the generated crosswalk to fetch and store PDFs.
To test with specific parameters:
Verification:
cl/search/crosswalks/
directory for generated crosswalk files.Screenshots:
CAP Crosswalk File (Sample Data):
Imported PDFs (Local Storage):