This is a complete pipeline to create a comparable/parallel corpus made of European Parliament's proceedings enriched with speakers' metadata.
This pipeline has been tested in macOS Sierra, it should work in UNIX too. Basically, Python 3 is required for almost every script. Some Python modules and/or tools might be needed too. Check specific requirements for each script.
Related projects:
- LinkedEP/The talk of Europe: Plenary debates of the European Parliament as Linked Open Data.
- CoStEP: Corrected & Structured Europarl Corpus
- Koehn's EuroParl: European Parliament Proceedings Parallel Corpus
- ECPC: European Comparable and Parallel Corpora
- DCEP: Digital Corpus of the European Parliament
localization
, resources for adaptation of scripts to different languages.LICENSE
, MIT license.README.md
, this file.add_metadata.py
, script to add MEPs metadata (CSV) to proceedings (XML).add_sentences.py
, script to split text in sentences with NLTK's Punkt tokenizer.compile.sh
, script to run the whole pipeline to compile the EuroParl corpus.dates.txt
, one date per line in format YYYY-MM-DD.get_meps.py
, script to scrap MEPs information.get_proceedings.py
, script to scrap Proceedings of the European Parliament.langid_filter.py
, filter out paragraphs whose real language is not the expected (the same of the proceedings).meps_ie.py
, script to extract MEPs metadata from HTML to CSV.proceedings_txt.py
, script to extract text from HTML proceedings.proceedings_xml.py
, script to model as XML text and metadata from HTML proceedings.translationse_filter.py
, script to classify utterances as original, translations and even by native speaker.treetagger.py
, script to tokenize, lemmatize and tag PoS using TreeTagger producing well-formed XML.
You can find the complete pipeline to compile the EuroParl corpus in compile.sh
.
- Download proceedings in HTML with
get_proceedings.py
- Download MEPs metadata in HTML with
get_meps.py
- Extract MEPs' information in a CSV file with
meps_ie.py
- Model proceedings as XML with
proceedings_xml.py
- Filter out text units not in the expected language (e.g. Bulgarian text in the English version) with
langid_filter.py
- Add MEPs metadata to proceedings with
add_metadata.py
- Add sentence boundaries (if needed) with
add_sentences.py
- Annotate token, lemma, PoS with TreeTagger with
treetagger.py
- Separate originals from translations and even filter by native speakers with
translationese_filter.py
It adds MEPs' metadata to interventions in the proceedings. Be aware that not all speakers speaking before the European Parliament are MEPs. There are members of other European Institutions, representatives of national institutions, guests, etc. speaking here. There is currently no metadata for those but the information extracted from the very same proceedings.
It reads 3 files containing the metadata:
meps.csv
which contains basic information about the MEP: id, name, nationality, birth date, birth place, death date, death place.national_parties.csv
which contains political affiliation of the MEP in his/her country: id, start date, end date, and name of the party.political_groups.csv
which contains political affiliation at the European Parliament: id, Member State, start date, end date, name of the group, role within the group.
For each proceeding in XML it retrieves all interventions whose speaker is an MEP. Then it adds relevant speaker's metadata to the intervention. By relevant we mean the valid information at the day of the session.
- Python 3
- lxml
- pandas
It splits text contained in a given XML element into sentences.
Using NLTK Punkt Tokenizer.
For each XML file, extracts all elements containing text. Each unit is passed to the tokenizer. It returns a list of sentences, which are converted in subelements of the element which was containing the text.
- Python 3
- lxml
- nltk and the Punkt Tokenizer Models for nltk.tokenize.punkt, follow Installing NLTK Data instructions.
It runs the whole pipeline in one shot.
It is a shell script running a list of commands sequentially. If no arguments provided it runs the full pipeline for all the supported languages.
One can provide the -l
language argument, to run the pipeline only on a particular language, and -p
pattern argument, to restrict the processing to a year, month, day...
All the requirements listed in this section and a bash shell.
It downloads the MEPs' information avaliable at the web of the European Parliament in HTML format.
First, it gets a list of all MEPs (past and present).
For each item in this list, it generates an URL and it downloads the page which contains basic information and the history record of the speaker.
- Python 3
- lxml
- requests
It downloads all the proceedings in a particular language version within a range of dates.
If a file with dates is given it generates an URL for each date and downloads the proceedings in HTML format. If no file with dates is provided, it generates all possible dates within a range, and tries to download only those URLs returning a sucessful response.
- Python 3
- requests
Sometimes, interventions remain untranslated and thus their text appear in their original language. In order to avoid this noise, langid_filter.py
identifies the most probable language of each text unit (namely paragraphs) and remove those paragraphs which are not in the expected language (e.g. Bulgarian fragments found in the English version).
All paragraphs (or units containing the text to be analyzed) are retrieved.
Each unit is analyzed with to language identifiers available for Python: langdetect
and langid
. A series of heuristics are then used to exploit the output of the language analyzers:
If the expected language and the language identified by both tools is the same, the text is in the same language as the language of the version at stake.
If both tools agree in identifying the language which is different to the expected version, the text is in a different language as the expected one, and, thus, removed.
For cases where there is not a perfect agreement a few rules are formalized working fairly well.
- Python 3
- lxml
- langdetect
- langid
It extracts MEPs information from semistructured HTML and yields the information in tabular format.
It reads each HTML instance, and using XPath and regular expressions it finds relevant information which is finally serialized as three CSV files:
meps.csv
national_parties.csv
political_groups.csv
- Python 3
- lxml
- pandas
It extracts basic metadata about the parliamentary session, the structure of the text, and about the speakers and the source language of the utterances, and the actual text of the proceedings.
- text: a session of parliamentary debates, it contains one or more sections.
- id: YYYYMMDD.XX, date and language
- lang: 2-letter ISO code for the language version
- date: YYYY-MM-DD
- place
- edition
- section: an agenda item
- id: ID as in the HTML for reference
- title: CDATA, text of the heading/headline.
- intervention
- id: ID as in the HTML for reference
- speaker_id: if MEMP unique code
- name
- is_mep: True or False
- mode: spoken, written
- role: function of the speaker at the moment of speaking
- p: paragraphs
- sl: Source Language
- PCDATA: actual text
- a: annotations
- text: CDATA
It reads each HTML file, and using XPath and regular expressions it maintains the structure of the debates, and extracts metatextual information about the SL, the speaker, etc.
- Python 3
- lxml
- regex
- dateparser
It extracts all the text in the HTML proceedings.
It parses the HTML, extracts only text, and clean a bit the output.
- Python 3
- lxml
It filters out interventions to get:
- only originals
- all translations
- the translations of a particular SL
It reads proceedings in XML and outputs XML with only the relevant paragraphs and their corresponding ancestors.
If native speakers (defined here as someone holding the nationality of a country with the SL as official language), XML with MEPs metadata is required.
It ritrieves all intervention and filters out interventions to keep only:
- originals: if
sl
==lang
- translations: if
sl
!=lang
andsl
!= unknown - translations from language xx
- Python 3
- lxml
It tokenizes, lemmatizes, tags PoS and splits into sentences a text.
It reads an XML file and annotates the text contained in a given element. It cares of producing well-formed XML as output.
- nltk (if tagging sentences is required)
- treetaggerwrapper
- TreeTagger
- language parameters
We use the script get_proceedings.py
to:
- generate a range of dates, or read from file,
- for each date,
- generate a URL,
- request it,
- if it exists,
- download the document,
- else,
- proceed with the next date.
This is the typical URL for the proceedings of a given day (namely, May 5 2009): http://www.europarl.europa.eu/sides/getDoc.do?pubRef=-//EP//TEXT+CRE+20090505+ITEMS+DOC+XML+V0//EN&language=EN
# get proceedings for English with defaults
python get_proceedings.py -o /path/to/output/dir -l EN
# get proceedings for Spanish using a list of dates
python get_proceedings.py -o /path/to/output/dir -l ES -d dates.txt
# get proceedings for German using a range of dates between two values
python get_proceedings.py -o /path/to/output/dir -l DE -s 2000-01-01 -e 2004-07-01
The European Parliament website maintains a database with all Members of the European Parliament.
We use the script get_meps.py
to download as HTML the metadata of all MEPs.
- The script retrieves an XML file containing a list of all MEPs full names and unique IDs: http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all&leg=0;
- For each MEP, generate an URL to the actual HTML file containing the metadata: http://www.europarl.europa.eu/meps/en/33569/33569_history.html;
- request it;
- download and proceed with the next.
python get_meps.py -o /path/to/output/dir
- This is also a valid pattern URL for a MEP: http://www.europarl.europa.eu/meps/en/33569/SYED_KAMALL_history.html
query=full
: all available data. http://www.europarl.europa.eu/meps/en/xml.html?query=fullfilter=all
: all MEPs, alternatively, one can choose an alphabet letter (A, B, ...) to filter only speakers whose family name starts with that letter. http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all or http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=Cleg=0
: all legislatures, integers help to select a past legislature, if no value provided just current legislature. http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all&leg=0- http://www.europarl.europa.eu/meps/en/xml.html?query=full&filter=all yields basic metadata for current legislature.
http://docs.python-guide.org/en/latest/scenarios/scrape/
python proceedings_txt.py -i /path/to/html -o /path/to/output
python proceedings_xml.py -i /path/to/html -o /path/to/xml -l EN
python langid_filter.py -i /path/to/xml -o /path/to/xml
python meps_ie.py -i /path/to/metadata/dir -o /path/to/output/dir
python add_metadata.py -m /path/to/meps.csv -n /path/to/national_parties.csv -g /path/to/political_groups.csv -x /path/to/source/xml/dir -p "*.xml" -o /path/to/output/xml/dir
Filtering text after translationese criteria: original, translations, and restrict to native speakers
# English originals in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en
# All translations in English proceedings
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l all
# Translations from English in Spanish proceedings
python translationese_filter.py -i /path/to/spanish/proceedings -o /path/to/output/dir -l en
# English originals in English proceedings by native speakers
python translationese_filter.py -i /path/to/english/proceedings -o /path/to/output/dir -l en -n