Court name issues #129

bbernicker · 2022-09-22T02:57:32Z

This PR addresses #128. The previous code returned the court ID of the first court for which the courts-db cite string started with the court abbreviation eyecite detected in the parenthetical. There was a comment on that code explaining that it used startswith instead of requiring a match because the court abbreviations from eyecite were often missing final punctuation.

This PR looks for courts in courts-db for which the cite string, with or without its final character, matches. This solution is not perfect because the citation_string field in courts-db has many duplicates: for 1981 courts, there are only 1174 unique citation strings. On the other hand, allowing matches without regard to the last character does not seem to make the problem worse, as there are only two citation strings which are the same but for their last character.

Ultimately, ensuring that Eyecite can correctly determine the court which rendered an opinion from a citation will require cleaning up courts-db so that duplicate citation_strings can be disambiguated using other info (such as reporter) and so that alternate citation_strings are permitted for courts (see freelawproject/courts-db#53). We should also add a more robust system for determining the rendering court from the reporter (e.g. with vendor neutral citations) or from neighboring parallel cites (esp. any in the metadata.extras field).

Finally, I added some tests to check that Eyecite is handling court names reasonably well.

The comment above the buggy code indicates that it was trying to account for missing punctuation. I did this by allowing matches where the last character of the canonical form is different/missing. I also created a test file to make sure Eyecite gets the right court names when fed some of the most common courts.

This commit updates the unit test which ensures that Eyecite is returning the correct court names.

bbernicker · 2022-09-22T03:08:38Z

I ran the tests natively and they all passed. Most of the commit checks that are failing seem to be due to my linter allowing slightly wider lines than your linter (a problem I have noticed while commiting to other FLP projects, and that I should probably address), but I don't know what about this commit could be affecting fast_diff_match_patch and causing build 3.10 to fail.

flooie · 2022-09-23T16:01:03Z

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

mlissner · 2022-09-23T16:16:59Z

@flooie can you also opine about how this bug affects CourtListener itself?

bbernicker · 2022-09-23T18:41:11Z

I'm actively working on the tests here in courts-db as im putting in some medium sized changes and thinking about how it should work and could work better.

If there is anything I can do to help, I'd be happy to jump on a call to talk though it.

flooie · 2022-09-26T15:05:42Z

@bbernicker are you in our slack channel yet?

bbernicker · 2022-09-26T15:15:59Z

@bbernicker are you in our slack channel yet?

No not yet.

mlissner · 2022-09-26T17:36:23Z

Just sent an invite, @bbernicker.

bbernicker · 2022-11-26T19:21:42Z

Any updates on this PR @flooie?

…reelawproject#129).

mattdahl · 2023-09-22T20:10:12Z

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

bbernicker · 2023-09-23T22:27:35Z

@bbernicker I was just reviewing this PR in light of #144 and the changes made to courts-db in #135 (comment). I think your proposed change is still appropriate and necessary, but it's a little complicated.

Right now, we use startswith() to check for court string equality. The idea is that we want to match edge cases with missing final periods, e.g., 2d. Cir in an parsed citation should match 2d. Cir. from courts-db.

This is problematic, as you point out, because something like Tex. will match any number of longer strings from courts-db, e.g., Tex. Rev., Tex. Rev. Trib., etc., in addition to the simple Tex. court that it should match solely. So you propose replacing startswith() with a check for exact equality first, and, in the alternative, equality minus the final character.

This solves the Tex.-style collision problem, but introduces a new one: What if there are legitimate substrings that should be matched, but that differ in more ways than just the last character? For example, we want Commonwealth v. Bauer, 604 A.2d 1098 (Pa. Super. 1992) to be linked to the pasuperct slug, but the citation string for that slug is Pa. Super. Ct.! Since the two strings differ by more than just a missing period (one is missing the full Ct. suffix), they will not be matched under your code change, though they would be under the existing greedy startswith() approach.

I still think getting rid of startswith() is the better option, but we may want to consider this new kind of error that doing so will introduce. One solution, I think, would be to enforce strict equality first, and then use startswith() only as a backup, but I'd have to think about how to implement that efficiently. In any case, the issue your PR is addressing is not at all moot (as I had erroneously thought before)!

Thanks Matt. This is really helpful. Let me give this problem some thought.

bbernicker added 2 commits September 21, 2022 21:42

Update test_CourtsTest.py

e17e8cf

This commit updates the unit test which ensures that Eyecite is returning the correct court names.

Remove unnecessary dependencies from new test file

064272c

Merge branch 'freelawproject:main' into court-name-issues

a3298d4

mattdahl mentioned this pull request Jan 18, 2023

Getting full citation span #135

Open

mattdahl mentioned this pull request Feb 23, 2023

Fix court string matching with whitespace #144

Open

mattdahl added a commit to mattdahl/eyecite that referenced this pull request Sep 22, 2023

fix(find): Refactors overly greedy startswith() court slug check (cf. f…

2d0f7cb

…reelawproject#129).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Court name issues #129

Court name issues #129

bbernicker commented Sep 22, 2022 •

edited

Loading

bbernicker commented Sep 22, 2022

flooie commented Sep 23, 2022

mlissner commented Sep 23, 2022

bbernicker commented Sep 23, 2022

flooie commented Sep 26, 2022

bbernicker commented Sep 26, 2022

mlissner commented Sep 26, 2022

bbernicker commented Nov 26, 2022

mattdahl commented Sep 22, 2023

bbernicker commented Sep 23, 2023

Court name issues #129

Are you sure you want to change the base?

Court name issues #129

Conversation

bbernicker commented Sep 22, 2022 • edited Loading

bbernicker commented Sep 22, 2022

flooie commented Sep 23, 2022

mlissner commented Sep 23, 2022

bbernicker commented Sep 23, 2022

flooie commented Sep 26, 2022

bbernicker commented Sep 26, 2022

mlissner commented Sep 26, 2022

bbernicker commented Nov 26, 2022

mattdahl commented Sep 22, 2023

bbernicker commented Sep 23, 2023

bbernicker commented Sep 22, 2022 •

edited

Loading