Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4f37a79
improve docs (typos and broken link)
Meesch Oct 8, 2024
76776ca
update launch json to include loadcorpora
Meesch Oct 29, 2024
81463b2
preliminary corpus definition turkiye
Meesch Oct 29, 2024
ec5c678
add organisational metadata to parlamint turkiye
Meesch Oct 31, 2024
ffced2b
add more speaker metadata to parlamint turkiye
Meesch Oct 31, 2024
e3fe322
add role info for parlamint turkiye
Meesch Oct 31, 2024
83554ea
add speaker_constituency to parlamint-turkiye
Meesch Nov 6, 2024
0e931e4
cleanup
Meesch Nov 6, 2024
4169706
add description page for parlamint-turkiye
Meesch Nov 14, 2024
1c7844d
cleanup
Meesch Nov 14, 2024
8c6afb7
additional documentation
Meesch Jan 8, 2025
4d8fad8
Merge branch 'develop' into feature/parlamint-turkey
Meesch Apr 18, 2025
fd85664
update format function to read TEI-XML for Parlamint
Meesch Jul 1, 2025
65b6c69
Merge branch 'develop' into feature/parlamint-turkey
Meesch Jul 1, 2025
ec19fc5
Merge branch 'feature/parlamint-turkey' of https://github.com/CentreF…
Meesch Jul 2, 2025
79704d6
change behavior parliamentary corpora - include all languages specified
Meesch Jul 8, 2025
5787f35
add translated_speech to parlamint corpora
Meesch Jul 9, 2025
2f8c91a
update launch.json to delete the index when debugging
Meesch Jul 9, 2025
2af429d
inital implementation of NER Parlamint
Meesch Aug 7, 2025
bff7a56
Merge branch 'develop' into feature/parlamint-turkey
Meesch Nov 6, 2025
a40d616
fix merge dependencies in parlamint
Meesch Nov 6, 2025
a083f35
update field default parliament to not make the keyword fields search…
Meesch Nov 7, 2025
b0434d6
create preliminary parlamint corpus for all corpora combined
Meesch Nov 16, 2025
9b5acf6
add additional extractor for parlamint date field
Meesch Nov 16, 2025
c763457
add filter for political leaning
Meesch Nov 16, 2025
6c26bbb
include non-MP option for parliamentary role
Meesch Nov 16, 2025
f255a73
preliminary parlamint corpus for all countries
Meesch Nov 17, 2025
877bd94
use translated speech as the main content field for the full parlamin…
Meesch Nov 17, 2025
b9e10d7
activate all countries for the massive parlamint corpus
Meesch Nov 17, 2025
6a9e2aa
add country to the visualisations
Meesch Nov 17, 2025
695fcc8
fix capital typo
Meesch Nov 17, 2025
e801d3f
increase option count for country to include all countries
Meesch Nov 19, 2025
a5cc664
add preliminary language constants for each parlamint country
Meesch Dec 17, 2025
b6de6fc
Merge branch 'develop' into feature/parlamint-v5
Meesch Dec 17, 2025
2517201
start dividing subcorpora per country
Meesch Dec 17, 2025
44c18de
implement supplementary stopword lists
Meesch Dec 18, 2025
c9fa0bd
include stopword lists for several unsupported languages
Meesch Dec 30, 2025
6382303
include corpora for each parlamint country
Meesch Dec 30, 2025
8fa208a
make speech and speech_translated new FieldDefinitions instead of def…
Meesch Jan 2, 2026
b725e95
change parlamint-all index name
Meesch Jan 2, 2026
5e3356d
fix: wrong variable name
Meesch Jan 2, 2026
0f9d173
workaround for translated_speech for the UK
Meesch Jan 2, 2026
6e992c2
include chamber field in parlamint
Meesch Jan 2, 2026
8f54436
add government field to parlamint
Meesch Jan 2, 2026
5cf88e4
include workaround for translated_speech for the UK
Meesch Jan 5, 2026
8a0fa70
include names for each parliament
Meesch Jan 5, 2026
ee0aa43
documentation
Meesch Jan 5, 2026
2d05295
include ministerial role in parlamint corpora
Meesch Jan 5, 2026
8c8786f
improve logic
Meesch Jan 5, 2026
1132e20
include markdown description for parlamint
Meesch Jan 5, 2026
fc96786
harmonise legacy and recent parlamint logic into two files
Meesch Jan 5, 2026
4e729ca
restore legacy parlamint utils file for p&p finland corpus
Meesch Jan 6, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 13 additions & 4 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
"configurations": [
{
"name": "django: runserver",
"type": "python",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/backend/manage.py",
"args": ["runserver"],
Expand All @@ -15,7 +15,7 @@
},
{
"name": "django: shell",
"type": "python",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/backend/manage.py",
"args": ["shell"],
Expand All @@ -24,10 +24,19 @@
},
{
"name": "django: index",
"type": "python",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/backend/manage.py",
"args": ["index", "${input:corpusName}"],
"args": ["index", "${input:corpusName}", "-d"],
"django": true,
"justMyCode": true
},
{
"name": "django: loadcorpora",
"type": "debugpy",
"request": "launch",
"program": "${workspaceFolder}/backend/manage.py",
"args": ["loadcorpora"],
"django": true,
"justMyCode": true
},
Expand Down
44 changes: 28 additions & 16 deletions backend/addcorpus/es_settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,32 +23,42 @@ def get_language_key(language_code):

return Language.make(standardize_tag(language_code)).display_name().lower()

def _stopwords_directory() -> str:
stopwords_dir = os.path.join(settings.NLTK_DATA_PATH, 'corpora', 'stopwords')
if not os.path.exists(stopwords_dir):
def _nltk_stopwords_directory() -> str:
nltk_stopwords_dir = os.path.join(settings.NLTK_DATA_PATH, 'corpora', 'stopwords')
if not os.path.exists(nltk_stopwords_dir):
nltk.download('stopwords', settings.NLTK_DATA_PATH)
return stopwords_dir
return nltk_stopwords_dir

def _stopwords_path(language_code: str):
dir = _stopwords_directory()
def _nltk_stopwords_path(language_code: str):
dir = _nltk_stopwords_directory()
language = get_language_key(language_code)
return os.path.join(dir, language)

def _supplementary_path(language_code: str):
dir = os.path.join(settings.BASE_DIR, 'addcorpus', 'stopword_data', 'supplementary_data')
language = get_language_key(language_code)
return os.path.join(dir, language)

def stopwords_available(language_code: str) -> bool:
if not language_code:
return False
path = _stopwords_path(language_code)
return os.path.exists(path)

def get_nltk_stopwords(language_code):
path = _stopwords_path(language_code)

if os.path.exists(path):
with open(path) as infile:
nltk_path = _nltk_stopwords_path(language_code)
supplementary_path = _supplementary_path(language_code)
return True if (os.path.exists(nltk_path) or os.path.exists(supplementary_path)) else False

def get_stopwords(language_code):
nltk_path = _nltk_stopwords_path(language_code)
supplementary_path = _supplementary_path(language_code)
if os.path.exists(nltk_path):
with open(nltk_path) as infile:
words = [line.strip() for line in infile.readlines()]
return words
elif os.path.exists(supplementary_path):
with open(supplementary_path) as infile:
words = [line.strip() for line in infile.readlines()]
return words
Comment on lines +52 to 59
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid repetition:

    if os.path.exists(nltk_path):
        return _read_stopwords_file(nltk_path)
    elif os.path.exists(supplementary_path):
        return _read_stopwords_file(supplementary_path)
    # else: ...

def _read_stopwords_file(path: str) -> List[str]:
    with open(path) as infile:
        return [line.strip() for line in infile.readlines()]

else:
raise NotImplementedError('language {} has no nltk stopwords list'.format(language_code))
raise NotImplementedError('language {} has no stopwords list'.format(language_code))

def add_language_string(name, language):
return '{}_{}'.format(name, language) if language else name
Expand Down Expand Up @@ -87,6 +97,8 @@ def es_settings(languages=[], stopword_analysis=False, stemming_analysis=False):

if stopword_analysis or stemming_analysis:
if not set_stopword_filter(settings, add_language_string(stopword_filter_name, language), language):
warnings.warn('You specified `stopword_analysis=True`, but \
there are no stopwords available for this language')
continue # skip languages for which we do not have a stopword list

if stopword_analysis:
Expand Down Expand Up @@ -119,7 +131,7 @@ def number_filter():

def make_stopword_filter(language):
try:
stopwords = get_nltk_stopwords(language)
stopwords = get_stopwords(language)
return {
"type": "stop",
'stopwords': stopwords
Expand Down
12 changes: 12 additions & 0 deletions backend/addcorpus/stopword_data/supplementary_data/README.md
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 👍 👍

Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
## Supplementary Data Sources
Source 1: For Bulgarian, Czech, Croatian, Galician, Latvian, and Ukrainian, stopword lists were downloaded from this [Github repository](https://github.com/negapedia/nltk/tree/master/corpora/stopwords), by [Marco Chilese](https://github.com/MarcoChilese). The stopword lists are a combination of nltk stopwords (where available) and stopwords from [ranks.nl](https://www.ranks.nl/stopwords/). They were downloaded on 2025-12-18.

Source 2: For Bosnian stopwords, the following publication was used: Sead Jahić, & Jernej Vičič. (2023). Lists of stopwords, polarity shifters and AnAwords of Bosnian language [Data set]. Zenodo. https://doi.org/10.5281/zenodo.10373141

Source 3: For Estonian stopwords, the following Github repository was used: https://github.com/stopwords-iso/stopwords-et?tab=readme-ov-file.

Source 4: For Icelandic stopwords, the following Github repository was used: https://github.com/ViktorMS/stoppord/blob/master/stoppord.csv

Source 5: For Serbian stopwords, the following Github repository was used: https://github.com/Xangis/extra-stopwords/blob/master/serbian

Source 6: For Slovenian stopwords, the following Github repository was used: https://github.com/stopwords-iso/stopwords-sl/blob/master/raw/gh-stopwords-json-sl.txt
Loading