Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipeline to add NER annotations from ParlaMint to ES index #1681

Merged
merged 21 commits into from
Nov 21, 2024

Conversation

BeritJanssen
Copy link
Contributor

@BeritJanssen BeritJanssen commented Oct 24, 2024

Close CentreForDigitalHumanities/TextMiNER#7

Originally, I thought to tackle this from the TextMiNER repo, but when I thought about it, it seemed easier & more practical to add this to the indexing logic in I-Analyzer.

@Meesch : this is the branch I mentioned that has methods for parsing an annotated dataset from ParlaMINT.

@BeritJanssen BeritJanssen requested a review from Meesch November 7, 2024 15:21
Copy link
Contributor

@lukavdplas lukavdplas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation looks fine, kudos for rigorous testing. But some of the design choices seem a bit odd to me, see comments.

Comment on lines 135 to 143
def validate_custom_slug(slug: str):
"""
reject names which contain characters other than colons, hyphens, underscores or alphanumeric
"""
slug_re = re.compile(r"^[\w:-]+$")
if not slug_re.match(slug):
raise ValidationError(
f'{value} cannot be used as a field name: the suffix `:ner` is reserved for annotated_text fields'
f"{slug} is not valid: it should consist of no other characters than letters, numbers, underscores, hyphens or colons"
)
Copy link
Contributor

@lukavdplas lukavdplas Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this allows colons, it's not actually testing whether the string is a slug (at least, from any definition of slug that I've heard), so the function name is misleading.

Comment on lines 148 to 150
Checks if colons are in field name, will raise ValidationError if the field does not meet the following requirements:
- starts with `ner:` prefix and is a keyword field
- ends with `:ner` suffix and is an annotated_text field
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this division like this?

It makes sense that it's useful to distinguish between the keyword and annotated text versions of a field, but doing so by using a suffix for one and a prefix for the other is weirdly opaque. You can pick any prefix/suffix you want here, why not choose something that actually describes (or at least hints) which is which?

@@ -269,9 +269,10 @@ def parliament_corpora_settings(settings):
"date": "2017-01-31",
"chamber": "Tweede Kamer",
"debate_title": "Report of the meeting of the Dutch Lower House, Meeting 46, Session 23 (2017-01-31)",
"debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23",
"debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23.ana",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the .ana is used in files that contain annotations, but it's not really part of the debate ID. Since the corpus is already in use, it would also be good to preserve existing field values if possible.

So rather than update the test here, can the .ana be removed from the field value during extraction?

}
for year in range(start.year, end.year):
for xml_file in glob("{}/{}/*.xml".format(self.data_directory, year)):
metadata["ner"] = extract_named_entities(xml_file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is surprising. I was under the impression that the NER keyword fields were intended to be used for filtering. As a user, I would expect that an NER filter would filter on entities mentioned in the speech, not all entities mentioned in the debate that the speech takes place in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to double-check, but I thought that this is the case. It's just that the metadata of the whole file is collected at this point, so it doesn't have to be reopened for every ner field separately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the extract_named_entities saves the entities ordered by speech id, and the keyword field definitions extract only the relevant entities for the speech. I'll add docstrings to document that.

@BeritJanssen BeritJanssen merged commit 6f5787a into develop Nov 21, 2024
2 checks passed
@BeritJanssen BeritJanssen deleted the feature/parlamint-ner branch November 21, 2024 09:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow parsing named entities from a file
2 participants