-
Notifications
You must be signed in to change notification settings - Fork 3
ParlaMint 5 #1975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
ParlaMint 5 #1975
Conversation
…orDigitalHumanities/I-analyzer into feature/parlamint-turkey
import cleanup
lukavdplas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work!
As we already discussed, I think it would be better if we used a shared alias for the ParliamentAll corpus, and there could be less repetition in the subcorpus definitions.
Other than that, this looks good. I added some minor comments here. The only real issue is the changes to corpora/parliament/utils/field_defaults.py, which would affect the P&P corpora as well.
| if os.path.exists(nltk_path): | ||
| with open(nltk_path) as infile: | ||
| words = [line.strip() for line in infile.readlines()] | ||
| return words | ||
| elif os.path.exists(supplementary_path): | ||
| with open(supplementary_path) as infile: | ||
| words = [line.strip() for line in infile.readlines()] | ||
| return words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid repetition:
if os.path.exists(nltk_path):
return _read_stopwords_file(nltk_path)
elif os.path.exists(supplementary_path):
return _read_stopwords_file(supplementary_path)
# else: ...
def _read_stopwords_file(path: str) -> List[str]:
with open(path) as infile:
return [line.strip() for line in infile.readlines()]| @@ -0,0 +1,58 @@ | |||
|
|
|||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this empty line is not intentional?
|
|
||
| Overcoming the obstacles of multilinguality and diversity of data formats, the project created interoperable and comparable corpora that facilitate transnational comparisons and enhance the understanding of parliamentary discourse and its societal impact locally and globally. | ||
|
|
||
| The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The corpora are available in open access and are a valuable source of information for researchers in a broad range of SSH disciplines, such as political and social sciences, media and communication studies, history and language studies, and are also relevant to policy makers. | |
| The corpora are available in open access and are a valuable source of information for researchers in a broad range of social sciences and humanities disciplines, such as political science, media and communication studies, history, and language studies, and are also relevant to policy makers. |
|
|
||
| The ParlaMint project is now being further developed in the OSCARS project [ParlaCAP](https://clarinsi.github.io/parlacap/), which will provide a robust dataset for tracking political agenda-setting across European parliaments. | ||
|
|
||
| The latest versions of the corpora are available under the CC BY license: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The latest versions of the corpora are available under the CC BY license: | |
| The latest versions of the corpora are available under the [CC BY 4.0 license](https://creativecommons.org/licenses/by/4.0/): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the pictures :) Also, good solution to make them look different from the P&P corpora.
Please add a licence file with the source for these. (This is mostly for compliance reasons, but adding the source is also convenient for developers.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 👍 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why edit the Parliament class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to field_defaults here may be a problem because they will also affect the P&P corpora. It makes sense to import field constructors from here, but maybe you can add an argument to the function to get different values? (e.g. corpus='parlamint'/corpus='peopleparliament')
| from addcorpus.python_corpora.filters import MultipleChoiceFilter | ||
| from addcorpus.python_corpora.corpus import FieldDefinition | ||
|
|
||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tiny Python comment: if you add a module docstring, it should go at the top of the file, before the imports. Then it will be recognised by help() / editor tooltips / etc.
jgonggrijp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just passing by, took a peek out of curiosity, and noticed two things that I have questions about.
Congratulations on the large amount of work done!
| @property | ||
| def fields(self): | ||
| return self._fields | ||
|
|
||
| @fields.setter | ||
| def fields(self, value): | ||
| self._fields = value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the added value of these accessors over just letting self.fields be a regular attribute?
| self.speech = FieldDefinition( | ||
| name='speech', | ||
| display_name='Speech', | ||
| description='The transcribed speech in the original language', | ||
| es_mapping = main_content_mapping( | ||
| token_counts=True, | ||
| stopword_analysis=True, | ||
| stemming_analysis=True, | ||
| language=self.languages[0], | ||
| ), | ||
| results_overview=True, | ||
| search_field_core=True, | ||
| display_type='text_content', | ||
| visualizations=['wordcloud', 'ngram'], | ||
| csv_core=True, | ||
| language=self.languages[0], | ||
| ) | ||
| self.speech.extractor = speech_extractor() | ||
| self.fields = [self.speech] + [field for field in self.fields if field.name != 'speech'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just override the speech field with regular inheritance?

At long last, it is here, all 29 ParlaMint corpora! This PR includes:
parlamint_all.pyparlamint_subcorpora.pyparlamint_utilsthere are three auxiliary files: one for constants, one for the extract logic, and one for the transform logic. This is all done to keep the sizes of these files somewhat under control.When reviewing, it is most useful to be able to test the index locally, so if you want the test data, I can send those over. The main files to review are
parlamint_all.pyandparlamint_subcorpora.py.