Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Court name issues #129

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

bbernicker
Copy link
Contributor

@bbernicker bbernicker commented Sep 22, 2022

This PR addresses #128. The previous code returned the court ID of the first court for which the courts-db cite string started with the court abbreviation eyecite detected in the parenthetical. There was a comment on that code explaining that it used startswith instead of requiring a match because the court abbreviations from eyecite were often missing final punctuation.

This PR looks for courts in courts-db for which the cite string, with or without its final character, matches. This solution is not perfect because the citation_string field in courts-db has many duplicates: for 1981 courts, there are only 1174 unique citation strings. On the other hand, allowing matches without regard to the last character does not seem to make the problem worse, as there are only two citation strings which are the same but for their last character.

Ultimately, ensuring that Eyecite can correctly determine the court which rendered an opinion from a citation will require cleaning up courts-db so that duplicate citation_strings can be disambiguated using other info (such as reporter) and so that alternate citation_strings are permitted for courts (see freelawproject/courts-db#53). We should also add a more robust system for determining the rendering court from the reporter (e.g. with vendor neutral citations) or from neighboring parallel cites (esp. any in the metadata.extras field).

Finally, I added some tests to check that Eyecite is handling court names reasonably well.

The comment above the buggy code indicates that it was trying to account for missing punctuation. I did this by allowing matches where the last character of the canonical form is different/missing.

I also created a test file to make sure Eyecite gets the right court names when fed some of the most common courts.
This commit updates the unit test which ensures that Eyecite is returning the correct court names.
@bbernicker
Copy link
Contributor Author

I ran the tests natively and they all passed. Most of the commit checks that are failing seem to be due to my linter allowing slightly wider lines than your linter (a problem I have noticed while commiting to other FLP projects, and that I should probably address), but I don't know what about this commit could be affecting fast_diff_match_patch and causing build 3.10 to fail.

@flooie
Copy link
Contributor

flooie commented Sep 23, 2022

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

@mlissner
Copy link
Member

@flooie can you also opine about how this bug affects CourtListener itself?

@bbernicker
Copy link
Contributor Author

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

If there is anything I can do to help, I'd be happy to jump on a call to talk though it.

@flooie
Copy link
Contributor

flooie commented Sep 26, 2022

@bbernicker are you in our slack channel yet?

@bbernicker
Copy link
Contributor Author

@bbernicker are you in our slack channel yet?

No not yet.

@mlissner
Copy link
Member

Just sent an invite, @bbernicker.

@bbernicker
Copy link
Contributor Author

Any updates on this PR @flooie?

@mattdahl
Copy link
Contributor

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

@bbernicker
Copy link
Contributor Author

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

Thanks Matt. This is really helpful. Let me give this problem some thought.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pull Requests
Development

Successfully merging this pull request may close these issues.

4 participants