Skip to content

Conversation

@lukavdplas
Copy link
Contributor

@lukavdplas lukavdplas commented Jan 22, 2026

Update to the EU parliamentary corpus. I've indexed the new data so this is ready to be released.

This update uses a different dataset for the older data (pre-2024). Compared to the current dataset (Talk of Europe), this dataset (EUPDCorp) covers a longer period, is easier to parse, and includes translations consistently.

For the post-2024 data, which is extracted from the EU's Open Data API, there are several fixes and improvements:

  • Speech text is now the full text, instead of just the first paragraph.
  • Greatly reduces the number of requests to the API which speeds up processing, as requests can take a long time. (The full job now takes hours instead of days.) The reader no longer cold-calls debate IDs to check if there was a debate on that day, and requests speeches in batches of 50 instead of one at a time.
  • Also extracts the original language text.
  • Extract more speaker metadata, including birth year, gender, and national party.
  • Various fixes for missing keys, null values, and other things that trip up extraction.

This branch includes some features that should be moved to ianalyzer_readers, namely:

Issues:

@lukavdplas lukavdplas changed the title Feature/eudpcorp Update European Parliament corpus Jan 22, 2026
@lukavdplas lukavdplas added the corpus changes to corpus definitions or new corpora label Jan 22, 2026
@lukavdplas lukavdplas self-assigned this Jan 28, 2026
@lukavdplas lukavdplas marked this pull request as ready for review January 29, 2026 12:00
@JeltevanBoheemen JeltevanBoheemen merged commit 89d5df3 into develop Feb 2, 2026
3 checks passed
@JeltevanBoheemen JeltevanBoheemen deleted the feature/eudpcorp branch February 2, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

corpus changes to corpus definitions or new corpora

Projects

None yet

Development

Successfully merging this pull request may close these issues.

European Parliament: incorrect links New European Parliament dataset

2 participants