ParlaMint 5 #1975

Meesch · 2026-01-05T17:42:55Z

At long last, it is here, all 29 ParlaMint corpora! This PR includes:

A big parlamint-all corpus definition that describes all the fields in parlamint_all.py
29 separate child-corpora in parlamint_subcorpora.py
In the parlamint_utils there are three auxiliary files: one for constants, one for the extract logic, and one for the transform logic. This is all done to keep the sizes of these files somewhat under control.
Some additional stopword lists were added from sources outside of nltk for languages that nltk does not support

When reviewing, it is most useful to be able to test the index locally, so if you want the test data, I can send those over. The main files to review are parlamint_all.py and parlamint_subcorpora.py.

…orDigitalHumanities/I-analyzer into feature/parlamint-turkey

…able

…t corpus

…aults

import cleanup

docs

lukavdplas

Good work!

As we already discussed, I think it would be better if we used a shared alias for the ParliamentAll corpus, and there could be less repetition in the subcorpus definitions.

Other than that, this looks good. I added some minor comments here. The only real issue is the changes to corpora/parliament/utils/field_defaults.py, which would affect the P&P corpora as well.

lukavdplas · 2026-01-08T11:56:25Z

backend/addcorpus/es_settings.py

+    if os.path.exists(nltk_path):
+        with open(nltk_path) as infile:
+            words = [line.strip() for line in infile.readlines()]
+            return words
+    elif os.path.exists(supplementary_path):
+        with open(supplementary_path) as infile:
            words = [line.strip() for line in infile.readlines()]
            return words


To avoid repetition:

if os.path.exists(nltk_path): return _read_stopwords_file(nltk_path) elif os.path.exists(supplementary_path): return _read_stopwords_file(supplementary_path) # else: ... def _read_stopwords_file(path: str) -> List[str]: with open(path) as infile: return [line.strip() for line in infile.readlines()]

lukavdplas · 2026-01-08T11:58:54Z

backend/addcorpus/stopword_data/supplementary_data/icelandic

@@ -0,0 +1,58 @@
+


I think this empty line is not intentional?

lukavdplas · 2026-01-08T12:02:54Z

backend/corpora/parliament/clarin_parlamint/description/parlamint_all.md

+
+Overcoming the obstacles of multilinguality and diversity of data formats, the project created interoperable and comparable corpora that facilitate transnational comparisons and enhance the understanding of parliamentary discourse and its societal impact locally and globally. 
+
+The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers.


Suggested change

The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers.

The corpora are available in open access and are a valuable source of information for researchers in a broad range of social sciences and humanities disciplines, such as political science, media and communication studies, history, and language studies, and are also relevant to policy makers.

lukavdplas · 2026-01-08T12:05:59Z

backend/corpora/parliament/clarin_parlamint/description/parlamint_all.md

+
+The ParlaMint project is now being further developed in the OSCARS project [ParlaCAP](https://clarinsi.github.io/parlacap/), which will provide a robust dataset for tracking political agenda-setting across European parliaments. 
+
+The latest versions of the corpora are available under the CC BY license:


Suggested change

The latest versions of the corpora are available under the CC BY license:

The latest versions of the corpora are available under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/):

lukavdplas · 2026-01-08T12:09:18Z

backend/corpora/parliament/clarin_parlamint/images/parlamint_AT.jpg

I like the pictures :) Also, good solution to make them look different from the P&P corpora.

Please add a licence file with the source for these. (This is mostly for compliance reasons, but adding the source is also convenient for developers.)

lukavdplas · 2026-01-09T12:21:28Z

backend/addcorpus/stopword_data/supplementary_data/README.md

👍 👍 👍

lukavdplas · 2026-01-09T12:28:51Z

backend/corpora/parliament/clarin_parlamint/images/parlamint.png

The circular logo would look a bit weird in the corpus overview, I think. This version would look better:

lukavdplas · 2026-01-09T12:44:30Z

backend/corpora/parliament/parliament.py

Why edit the Parliament class?

lukavdplas · 2026-01-09T12:52:19Z

backend/corpora/parliament/utils/field_defaults.py

The changes to field_defaults here may be a problem because they will also affect the P&P corpora. It makes sense to import field constructors from here, but maybe you can add an argument to the function to get different values? (e.g. corpus='parlamint'/corpus='peopleparliament')

lukavdplas · 2026-01-09T12:56:51Z

backend/corpora/parliament/utils/parlamint.py

+from addcorpus.python_corpora.filters import MultipleChoiceFilter
+from addcorpus.python_corpora.corpus import FieldDefinition
+
+"""


Tiny Python comment: if you add a module docstring, it should go at the top of the file, before the imports. Then it will be recognised by help() / editor tooltips / etc.

jgonggrijp

Was just passing by, took a peek out of curiosity, and noticed two things that I have questions about.

Congratulations on the large amount of work done!

jgonggrijp · 2026-01-20T14:52:26Z

backend/corpora/parliament/clarin_parlamint/parlamint_all.py

+    @property
+    def fields(self):
+        return self._fields
+
+    @fields.setter
+    def fields(self, value):
+        self._fields = value


What is the added value of these accessors over just letting self.fields be a regular attribute?

jgonggrijp · 2026-01-20T14:53:21Z

backend/corpora/parliament/clarin_parlamint/parlamint_subcorpora.py

+        self.speech = FieldDefinition(
+            name='speech',
+            display_name='Speech',
+            description='The transcribed speech in the original language',
+            es_mapping = main_content_mapping(
+                token_counts=True,
+                stopword_analysis=True,
+                stemming_analysis=True,
+                language=self.languages[0],
+            ),
+            results_overview=True,
+            search_field_core=True,
+            display_type='text_content',
+            visualizations=['wordcloud', 'ngram'],
+            csv_core=True,
+            language=self.languages[0],
+        )
+        self.speech.extractor = speech_extractor()
+        self.fields = [self.speech] + [field for field in self.fields if field.name != 'speech']


Why not just override the speech field with regular inheritance?

Meesch and others added 30 commits October 8, 2024 18:01

improve docs (typos and broken link)

4f37a79

update launch json to include loadcorpora

76776ca

preliminary corpus definition turkiye

81463b2

add organisational metadata to parlamint turkiye

ec5c678

add more speaker metadata to parlamint turkiye

ffced2b

add role info for parlamint turkiye

e3fe322

add speaker_constituency to parlamint-turkiye

83554ea

cleanup

0e931e4

add description page for parlamint-turkiye

4169706

cleanup

1c7844d

additional documentation

8c6afb7

Merge branch 'develop' into feature/parlamint-turkey

4d8fad8

update format function to read TEI-XML for Parlamint

fd85664

Merge branch 'develop' into feature/parlamint-turkey

65b6c69

Merge branch 'feature/parlamint-turkey' of https://github.com/CentreF…

ec19fc5

…orDigitalHumanities/I-analyzer into feature/parlamint-turkey

change behavior parliamentary corpora - include all languages specified

79704d6

add translated_speech to parlamint corpora

5787f35

update launch.json to delete the index when debugging

2f8c91a

inital implementation of NER Parlamint

2af429d

Merge branch 'develop' into feature/parlamint-turkey

bff7a56

fix merge dependencies in parlamint

a40d616

update field default parliament to not make the keyword fields search…

a083f35

…able

create preliminary parlamint corpus for all corpora combined

b0434d6

add additional extractor for parlamint date field

9b5acf6

add filter for political leaning

c763457

include non-MP option for parliamentary role

6c26bbb

preliminary parlamint corpus for all countries

f255a73

use translated speech as the main content field for the full parlamin…

877bd94

…t corpus

activate all countries for the massive parlamint corpus

b9e10d7

add country to the visualisations

6a9e2aa

Meesch added 22 commits November 17, 2025 14:03

fix capital typo

695fcc8

increase option count for country to include all countries

e801d3f

add preliminary language constants for each parlamint country

a5cc664

Merge branch 'develop' into feature/parlamint-v5

b6de6fc

start dividing subcorpora per country

2517201

implement supplementary stopword lists

44c18de

include stopword lists for several unsupported languages

c9fa0bd

include corpora for each parlamint country

6382303

make speech and speech_translated new FieldDefinitions instead of def…

8fa208a

…aults

change parlamint-all index name

b725e95

fix: wrong variable name

5e3356d

workaround for translated_speech for the UK

0f9d173

include chamber field in parlamint

6e992c2

add government field to parlamint

8f54436

include workaround for translated_speech for the UK

5cf88e4

include names for each parliament

8a0fa70

documentation

ee0aa43

include ministerial role in parlamint corpora

2d05295

improve logic

8c8786f

import cleanup

include markdown description for parlamint

1132e20

docs

harmonise legacy and recent parlamint logic into two files

fc96786

restore legacy parlamint utils file for p&p finland corpus

4e729ca

lukavdplas self-requested a review January 8, 2026 12:47

lukavdplas reviewed Jan 9, 2026

View reviewed changes

jgonggrijp reviewed Jan 20, 2026

View reviewed changes


		Overcoming the obstacles of multilinguality and diversity of data formats, the project created interoperable and comparable corpora that facilitate transnational comparisons and enhance the understanding of parliamentary discourse and its societal impact locally and globally.

		The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers.


		The ParlaMint project is now being further developed in the OSCARS project [ParlaCAP](https://clarinsi.github.io/parlacap/), which will provide a robust dataset for tracking political agenda-setting across European parliaments.

		The latest versions of the corpora are available under the CC BY license:

ParlaMint 5 #1975

Are you sure you want to change the base?

ParlaMint 5 #1975

Uh oh!

Conversation

Meesch commented Jan 5, 2026

Uh oh!

lukavdplas left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jgonggrijp left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants