Add pipeline to add NER annotations from ParlaMint to ES index #1681

BeritJanssen · 2024-10-24T19:17:14Z

Close CentreForDigitalHumanities/TextMiNER#7

Originally, I thought to tackle this from the TextMiNER repo, but when I thought about it, it seemed easier & more practical to add this to the indexing logic in I-Analyzer.

@Meesch : this is the branch I mentioned that has methods for parsing an annotated dataset from ParlaMINT.

lukavdplas

The implementation looks fine, kudos for rigorous testing. But some of the design choices seem a bit odd to me, see comments.

lukavdplas · 2024-11-15T14:17:54Z

backend/addcorpus/validation/creation.py

+def validate_custom_slug(slug: str):
+    """
+    reject names which contain characters other than colons, hyphens, underscores or alphanumeric
+    """
+    slug_re = re.compile(r"^[\w:-]+$")
+    if not slug_re.match(slug):
        raise ValidationError(
-            f'{value} cannot be used as a field name: the suffix `:ner` is reserved for annotated_text fields'
+            f"{slug} is not valid: it should consist of no other characters than letters, numbers, underscores, hyphens or colons"
        )


Since this allows colons, it's not actually testing whether the string is a slug (at least, from any definition of slug that I've heard), so the function name is misleading.

lukavdplas · 2024-11-15T14:25:16Z

backend/addcorpus/validation/creation.py

+    Checks if colons are in field name, will raise ValidationError if the field does not meet the following requirements:
+    - starts with `ner:` prefix and is a keyword field
+    - ends with `:ner` suffix and is an annotated_text field


Why is this division like this?

It makes sense that it's useful to distinguish between the keyword and annotated text versions of a field, but doing so by using a suffix for one and a prefix for the other is weirdly opaque. You can pick any prefix/suffix you want here, why not choose something that actually describes (or at least hints) which is which?

lukavdplas · 2024-11-15T14:35:41Z

backend/corpora/parliament/conftest.py

@@ -269,9 +269,10 @@ def parliament_corpora_settings(settings):
                "date": "2017-01-31",
                "chamber": "Tweede Kamer",
                "debate_title": "Report of the meeting of the Dutch Lower House, Meeting 46, Session 23 (2017-01-31)",
-                "debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23",
+                "debate_id": "ParlaMint-NL_2017-01-31-tweedekamer-23.ana",


I think the .ana is used in files that contain annotations, but it's not really part of the debate ID. Since the corpus is already in use, it would also be good to preserve existing field values if possible.

So rather than update the test here, can the .ana be removed from the field value during extraction?

lukavdplas · 2024-11-15T14:38:27Z

backend/corpora/parliament/netherlands.py

+        }
+        for year in range(start.year, end.year):
+            for xml_file in glob("{}/{}/*.xml".format(self.data_directory, year)):
+                metadata["ner"] = extract_named_entities(xml_file)


This is surprising. I was under the impression that the NER keyword fields were intended to be used for filtering. As a user, I would expect that an NER filter would filter on entities mentioned in the speech, not all entities mentioned in the debate that the speech takes place in.

I'll have to double-check, but I thought that this is the case. It's just that the metadata of the whole file is collected at this point, so it doesn't have to be reopened for every ner field separately.

Yes, the extract_named_entities saves the entities ordered by speech id, and the keyword field definitions extract only the relevant entities for the speech. I'll add docstrings to document that.

BeritJanssen added 13 commits October 3, 2024 11:58

feat: starting ner pipeline when indexing from ParlaMINT source

859c83a

feat: split ParliamentNetherlands into subcorpora

6440ff0

replace flat xml with annotated xml

0b6195f

add ner_keyword_field mapping

ef9b158

add data_directory to subcorpora

b7a5664

feat: get speech_ner content from soup

9edb2d5

add date_range filter to sources

19f9ec0

add unit tests and fix formatting

81cd3c8

add ner keyword fields and shorten test xml

1e3f44f

add validators for annotated_text mapping

fe3833c

fix: turn field.name to CharField

dc9047b

fix unit tests

ccdccaf

Merge branch 'develop' into feature/parlamint-ner

f48e039

BeritJanssen requested a review from lukavdplas November 7, 2024 11:22

BeritJanssen added 3 commits November 7, 2024 12:27

fix double quote mayhem

6f26c90

improve quote consistency

8fcd2cc

fix validator test

40090c6

BeritJanssen requested a review from Meesch November 7, 2024 15:21

lukavdplas requested changes Nov 15, 2024

View reviewed changes

BeritJanssen added 5 commits November 21, 2024 09:00

rename validate_custom_slug

ebf7bf2

normalize debate_id

a176f94

add docstrings

92f77a5

rename ner: fields to :ner-kw

0a06b4e

Merge branch 'develop' into feature/parlamint-ner

7eba817

BeritJanssen merged commit 6f5787a into develop Nov 21, 2024
2 checks passed

BeritJanssen deleted the feature/parlamint-ner branch November 21, 2024 09:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add pipeline to add NER annotations from ParlaMint to ES index #1681

Add pipeline to add NER annotations from ParlaMint to ES index #1681

BeritJanssen commented Oct 24, 2024 •

edited

Loading

lukavdplas left a comment

lukavdplas Nov 15, 2024 •

edited

Loading

lukavdplas Nov 15, 2024

lukavdplas Nov 15, 2024

lukavdplas Nov 15, 2024

BeritJanssen Nov 20, 2024

BeritJanssen Nov 21, 2024

Add pipeline to add NER annotations from ParlaMint to ES index #1681

Add pipeline to add NER annotations from ParlaMint to ES index #1681

Conversation

BeritJanssen commented Oct 24, 2024 • edited Loading

lukavdplas left a comment

Choose a reason for hiding this comment

lukavdplas Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

lukavdplas Nov 15, 2024

Choose a reason for hiding this comment

lukavdplas Nov 15, 2024

Choose a reason for hiding this comment

lukavdplas Nov 15, 2024

Choose a reason for hiding this comment

BeritJanssen Nov 20, 2024

Choose a reason for hiding this comment

BeritJanssen Nov 21, 2024

Choose a reason for hiding this comment

BeritJanssen commented Oct 24, 2024 •

edited

Loading

lukavdplas Nov 15, 2024 •

edited

Loading