Skip to content

Conversation

@Meesch
Copy link
Contributor

@Meesch Meesch commented Jan 5, 2026

At long last, it is here, all 29 ParlaMint corpora! This PR includes:

  • A big parlamint-all corpus definition that describes all the fields in parlamint_all.py
  • 29 separate child-corpora in parlamint_subcorpora.py
  • In the parlamint_utils there are three auxiliary files: one for constants, one for the extract logic, and one for the transform logic. This is all done to keep the sizes of these files somewhat under control.
  • Some additional stopword lists were added from sources outside of nltk for languages that nltk does not support

When reviewing, it is most useful to be able to test the index locally, so if you want the test data, I can send those over. The main files to review are parlamint_all.py and parlamint_subcorpora.py.

Meesch and others added 30 commits October 8, 2024 18:01
@lukavdplas lukavdplas self-requested a review January 8, 2026 12:47
Copy link
Contributor

@lukavdplas lukavdplas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work!

As we already discussed, I think it would be better if we used a shared alias for the ParliamentAll corpus, and there could be less repetition in the subcorpus definitions.

Other than that, this looks good. I added some minor comments here. The only real issue is the changes to corpora/parliament/utils/field_defaults.py, which would affect the P&P corpora as well.

Comment on lines +52 to 59
if os.path.exists(nltk_path):
with open(nltk_path) as infile:
words = [line.strip() for line in infile.readlines()]
return words
elif os.path.exists(supplementary_path):
with open(supplementary_path) as infile:
words = [line.strip() for line in infile.readlines()]
return words
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repetition:

    if os.path.exists(nltk_path):
        return _read_stopwords_file(nltk_path)
    elif os.path.exists(supplementary_path):
        return _read_stopwords_file(supplementary_path)
    # else: ...

def _read_stopwords_file(path: str) -> List[str]:
    with open(path) as infile:
        return [line.strip() for line in infile.readlines()]

@@ -0,0 +1,58 @@

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this empty line is not intentional?


Overcoming the obstacles of multilinguality and diversity of data formats, the project created interoperable and comparable corpora that facilitate transnational comparisons and enhance the understanding of parliamentary discourse and its societal impact locally and globally.

The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers.
The corpora are available in open access and are a valuable source of information for researchers in a broad range of social sciences and humanities disciplines, such as political science, media and communication studies, history, and language studies, and are also relevant to policy makers.


The ParlaMint project is now being further developed in the OSCARS project [ParlaCAP](https://clarinsi.github.io/parlacap/), which will provide a robust dataset for tracking political agenda-setting across European parliaments.

The latest versions of the corpora are available under the CC BY license:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The latest versions of the corpora are available under the CC BY license:
The latest versions of the corpora are available under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the pictures :) Also, good solution to make them look different from the P&P corpora.

Please add a licence file with the source for these. (This is mostly for compliance reasons, but adding the source is also convenient for developers.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The circular logo would look a bit weird in the corpus overview, I think. This version would look better:

parlamint logo

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why edit the Parliament class?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes to field_defaults here may be a problem because they will also affect the P&P corpora. It makes sense to import field constructors from here, but maybe you can add an argument to the function to get different values? (e.g. corpus='parlamint'/corpus='peopleparliament')

from addcorpus.python_corpora.filters import MultipleChoiceFilter
from addcorpus.python_corpora.corpus import FieldDefinition

"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tiny Python comment: if you add a module docstring, it should go at the top of the file, before the imports. Then it will be recognised by help() / editor tooltips / etc.

Copy link
Contributor

@jgonggrijp jgonggrijp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just passing by, took a peek out of curiosity, and noticed two things that I have questions about.

Congratulations on the large amount of work done!

Comment on lines +297 to +303
@property
def fields(self):
return self._fields

@fields.setter
def fields(self, value):
self._fields = value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the added value of these accessors over just letting self.fields be a regular attribute?

Comment on lines +58 to +76
self.speech = FieldDefinition(
name='speech',
display_name='Speech',
description='The transcribed speech in the original language',
es_mapping = main_content_mapping(
token_counts=True,
stopword_analysis=True,
stemming_analysis=True,
language=self.languages[0],
),
results_overview=True,
search_field_core=True,
display_type='text_content',
visualizations=['wordcloud', 'ngram'],
csv_core=True,
language=self.languages[0],
)
self.speech.extractor = speech_extractor()
self.fields = [self.speech] + [field for field in self.fields if field.name != 'speech']
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just override the speech field with regular inheritance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants