This repository contains the scripts to build the database and datasets from the European Court of Human Rights OpenData (ECHR-OD) project. The purposes of such repository are many:
-
Reproducibility: everyone can rebuild the entire database from scratch,
-
Extensibility: any new version of the database must be created from a updated version of those scripts.
-
Revision: all cases are automatically processed. There are many corner cases and such repository allow anyone to check the intermediate files to understand if the results are correct or not and locate the root cause of parsing errors.
-
Official website: ECHR-OD project
-
Original paper: paper, code, supplementary material
-
Creation process: https://github.com/echr-od/ECHR-OD_process
-
Explorer sources: https://github.com/echr-od/ECHR-OD_explorer
-
Mailing list: https://groups.google.com/g/echr-od
Receive the important announcements about the project (max. couple of emails per year) -
Discussion:
Get help and/or discuss about the project. Matrix is a decentralized, secured and open-source real-time communication system.
If you are using the project, please consider citing:
@article{ECHRDB,
title = {On Integrating and Classifying Legal Text Documents},
author = {Quemy, A. and Wrembel, R.},
year = 2020,
journal = {International Conference on Database and Expert Systems Applications (DEXA)}
}
There are two distinct type of versions:
-
Semantic versioning (e.g. 2.0.1) that indicates the version of the process. It relates only to the code and the type of data available.
- major revision indicates a change in the type of version available
- minor and patches related concern bugfix and improvements
-
Date of release (e.g. 2020-11-01), that indicates a when a build has been generated.
The database is meant to be updated every month with new cases. New releases are built upon an image created from the latest sources. Therefore, the date of release is technically enough to identify the semantic versioning. However, semantic versioning helps the maintainers and contributors with the development.
Recreating the database requires docker
.
To build the environment image:
docker build -f Dockerfile -t echr_build .
As long as dependencies are not changed, there is no need to rebuild the image.
Once the image is built, the container help can be displayed with:
docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build -h
In particular, to build the database:
docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build build
The entrypoint of the Extract-transform-load (ETL) process is build.py
.
The different ETL steps can be found in the subfolder echr/steps
.
The main build script load a workflow made of steps and execute each of them.
Workflows are YAML files and can be found in the folder workflows
.
The workflows provided with the project are:
- Local (
local.yml
): full ETL build locally, - Release (
release.yml
): full ETL including deployment to the server, - Database (
database.yml
): build the database only (no NLP model, no datasets), - Datasets (
datasets.yml
): build the datasets only (does not generate the database), - NLP Model (
NLP_model.yml
): build only the NLP model, - Runner (
runner.yml
): execute a workflow on an external runner.
We have the following relations:
Datasets = NLP Model + datasets generation step
Local = Database + Datasets
Release = Local + deployment step
This separation have been made because generating the NLP model takes up 95% of the whole Release
workflow time
and a tremendous amount of RAM (>16 Go).
Workflows may define variables using uppercase name starting by $
(e.g. $MAX_DOCUMENTS
).
The variables are replaced during the build process using the following order of priority:
- Environment variable
- CLI parameter
- From the configuration file, under
build.env.
- Global variable defined in
build.py
The general configuration file is config.yml
and contains three parts:
-
logging: related to logging files
-
steps: configuration for each step on top of the workflow
-
build: specific build configuration, in particular the section
env
contains the variables available to the whole workflow
There are two log files:
- The build log file:
build/<build>/logs/build.html
andbuild/<build>/logs/build.txt
- The process log file, mostly used for debug:
logs/build.log
To run the tests:
docker run -ti --mount src=$(pwd),dst=/tmp/echr_process/,type=bind echr_build test
- version 2.0.0: Changelogs
- version 1.0.2: Changelogs
- version 1.0.1: Changelogs
- Alexandre Quemy mailto:alexandre.quemy@gmail.com
- Paweł Mróz mailto:pawel.j.mroz@ibm.com
- Natalia Łopuszyńska mailto:nlopuszynska@gmail.com