-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add pipeline to add NER annotations from ParlaMint to ES index #1681
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation looks fine, kudos for rigorous testing. But some of the design choices seem a bit odd to me, see comments.
def validate_custom_slug(slug: str): | ||
""" | ||
reject names which contain characters other than colons, hyphens, underscores or alphanumeric | ||
""" | ||
slug_re = re.compile(r"^[\w:-]+$") | ||
if not slug_re.match(slug): | ||
raise ValidationError( | ||
f'{value} cannot be used as a field name: the suffix `:ner` is reserved for annotated_text fields' | ||
f"{slug} is not valid: it should consist of no other characters than letters, numbers, underscores, hyphens or colons" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this allows colons, it's not actually testing whether the string is a slug (at least, from any definition of slug that I've heard), so the function name is misleading.
Checks if colons are in field name, will raise ValidationError if the field does not meet the following requirements: | ||
- starts with `ner:` prefix and is a keyword field | ||
- ends with `:ner` suffix and is an annotated_text field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this division like this?
It makes sense that it's useful to distinguish between the keyword and annotated text versions of a field, but doing so by using a suffix for one and a prefix for the other is weirdly opaque. You can pick any prefix/suffix you want here, why not choose something that actually describes (or at least hints) which is which?
@@ -269,9 +269,10 @@ def parliament_corpora_settings(settings): | |||
"date": "2017-01-31", | |||
"chamber": "Tweede Kamer", | |||
"debate_title": "Report of the meeting of the Dutch Lower House, Meeting 46, Session 23 (2017-01-31)", | |||
"debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23", | |||
"debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23.ana", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the .ana
is used in files that contain annotations, but it's not really part of the debate ID. Since the corpus is already in use, it would also be good to preserve existing field values if possible.
So rather than update the test here, can the .ana
be removed from the field value during extraction?
} | ||
for year in range(start.year, end.year): | ||
for xml_file in glob("{}/{}/*.xml".format(self.data_directory, year)): | ||
metadata["ner"] = extract_named_entities(xml_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is surprising. I was under the impression that the NER keyword fields were intended to be used for filtering. As a user, I would expect that an NER filter would filter on entities mentioned in the speech, not all entities mentioned in the debate that the speech takes place in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to double-check, but I thought that this is the case. It's just that the metadata of the whole file is collected at this point, so it doesn't have to be reopened for every ner field separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the extract_named_entities
saves the entities ordered by speech id, and the keyword field definitions extract only the relevant entities for the speech. I'll add docstrings to document that.
Close CentreForDigitalHumanities/TextMiNER#7
Originally, I thought to tackle this from the TextMiNER repo, but when I thought about it, it seemed easier & more practical to add this to the indexing logic in I-Analyzer.
@Meesch : this is the branch I mentioned that has methods for parsing an annotated dataset from ParlaMINT.